Tuesday, August 14, 2018

Embedded builtins

V8 built-in functions (builtins) consume memory in every instance of V8. The builtin count, average size, and the number of V8 instances per Chrome browser tab have been growing significantly. This blog post describes how we reduced the median V8 heap size per website by 19% over the past year.

Background

V8 ships with an extensive library of JavaScript (JS) built-in functions. Many builtins are directly exposed to JS developers as functions installed on JS built-in objects, such as RegExp.prototype.exec and Array.prototype.sort; other builtins implement various internal functionality. Machine code for builtins is generated by V8’s own compiler, and is loaded onto the managed heap state for every V8 Isolate upon initialization. An Isolate represents an isolated instance of the V8 engine, and every browser tab in Chrome contains at least one Isolate. Every Isolate has its own managed heap, and thus its own copy of all builtins.

Back in 2015, builtins were mostly implemented in self-hosted JS, native assembly, or in C++. They were fairly small, and creating a copy for every Isolate was less problematic.

Much has changed in this space over the last years.

In 2016, V8 began experimenting with builtins implemented in CodeStubAssembler (CSA). This turned out to both be convenient (platform-independent, readable) and to produce efficient code, so CSA builtins became ubiquitous. For a variety of reasons, CSA builtins tend to produce larger code, and the size of V8 builtins roughly tripled as more and more were ported to CSA. By mid-2017, their per-Isolate overhead had grown significantly and we started thinking about a systematic solution.

V8 snapshot size (including builtins) from 2015 until 2017

In late 2017, we implemented lazy builtin (and bytecode handler) deserialization as a first step. Our initial analysis showed that most sites used less than half of all builtins. With lazy deserialization, builtins are loaded on-demand, and unused builtins are never loaded into the Isolate. Lazy deserialization was shipped in Chrome 64 with promising memory savings. But: builtin memory overhead was still linear in the number of Isolates.

Then, Spectre was disclosed, and Chrome ultimately turned on site isolation to mitigate its effects. Site isolation limits a Chrome renderer process to documents from a single origin. Thus, with site isolation, many browsing tabs create more renderer processes and more V8 Isolates. Even though managing per-Isolate overhead has always been important, site isolation has made it even more so.

Embedded builtins

Our goal for this project was to completely eliminate per-Isolate builtin overhead.

The idea behind it was simple. Conceptually, builtins are identical across Isolates, and are only bound to an Isolate because of implementation details. If we could make builtins truly isolate-independent, we could keep a single copy in memory and share them across all Isolates. And if we could make them process-independent, they could even be shared across processes.

In practice, we faced several challenges. Generated builtin code was neither isolate- nor process-independent due to embedded pointers to isolate- and process-specific data. V8 had no concept of executing generated code located outside the managed heap. Builtins had to be shared across processes, ideally by reusing existing OS mechanisms. And finally (this turned out to be the long tail), performance must not noticeably regress.

The following sections describe our solution in detail.

Isolate- and process-independent code

Builtins are generated by V8’s compiler internal pipeline, which embeds references to heap constants (located on the Isolate’s managed heap), call targets (Code objects, also on the managed heap), and to isolate- and process-specific addresses (e.g.: C runtime functions or a pointer to the Isolate itself, also called ’external references’) directly into the code. In x64 assembly, a load of such an object could look as follows:

// Load an embedded address into register rbx.
REX.W movq rbx,0x56526afd0f70

V8 has a moving garbage collector, and the location of the target object could change over time. Should the target be moved during collection, the GC updates the generated code to point at the new location.

On x64 (and most other architectures), calls to other Code objects use an efficient call instruction which specifies the call target by an offset from the current program counter (an interesting detail: V8 reserves its entire CODE_SPACE on the managed heap at startup to ensure all possible Code objects remain within an addressable offset of each other). The relevant part of the calling sequence looks like this:

// Call instruction located at [pc + <offset>].
call <offset>
A pc-relative call

Code objects themselves live on the managed heap and are movable. When they are moved, the GC updates the offset at all relevant call sites.

In order to share builtins across processes, generated code must be immutable as well as isolate- and process-independent. Both instruction sequences above do not fulfill that requirement: they directly embed addresses in the code, and are patched at runtime by the GC.

To address both issues, we introduced an indirection through a dedicated, so-called root register, which holds a pointer into a known location within the current Isolate.

Isolate layout

V8’s Isolate class contains the roots table, which itself contains pointers to root objects on the managed heap. The root register permanently holds the address of the roots table.

The new, isolate- and process-independent way to load a root object thus becomes:

// Load the constant address located at the given
// offset from roots.
REX.W movq rax,[kRootRegister + <offset>]

Root heap constants can be loaded directly from the roots list as above. Other heap constants use an additional indirection through a global builtins constant pool, itself stored on the roots list:

// Load the builtins constant pool, then the
// desired constant.
REX.W movq rax,[kRootRegister + <offset>]
REX.W movq rax,[rax + 0x1d7]

For Code targets, we initially switched to a more involved calling sequence which loads the target Code object from the global builtins constant pool as above, loads the target address into a register, and finally performs an indirect call.

With these changes, generated code became isolate- and process-independent and we could begin working on sharing it between processes.

Sharing across processes

We initially evaluated two alternatives. Builtins could either be shared by mmap-ing a data blob file into memory; or, they could be embedded directly into the binary. We took the latter approach since it had the advantage that we would automatically reuse standard OS mechanisms to share memory across processes, and the change would not require additional logic by V8 embedders such as Chrome. We were confident in this approach since Dart’s AOT compilation had already successfully binary-embedded generated code.

An executable binary file is split into several sections. For example, an ELF binary contains data in the .data (initialized data), .ro_data (initialized read-only data), and .bss (uninitialized data) sections, while native executable code is placed in .text. Our goal was to pack the builtins code into the .text section alongside native code.

Sections of an executable binary file

This was done by introducing a new build step that used V8’s internal compiler pipeline to generate native code for all builtins and output their contents in embedded.cc. This file is then compiled into the final V8 binary.

The (simplified) V8 embedded build process

The embedded.cc file itself contains both metadata and generated builtins machine code as a series of .byte directives that instruct the C++ compiler (in our case, clang or gcc) to place the specified byte sequence directly into the output object file (and later the executable).

// Information about embedded builtins are included in
// a metadata table.
V8_EMBEDDED_TEXT_HEADER(v8_Default_embedded_blob_)
__asm__(".byte 0x65,0x6d,0xcd,0x37,0xa8,0x1b,0x25,0x7e\n"
[snip metadata]

// Followed by the generated machine code.
__asm__(V8_ASM_LABEL("Builtins_RecordWrite"));
__asm__(".byte 0x55,0x48,0x89,0xe5,0x6a,0x18,0x48,0x83\n"
[snip builtins code]

Contents of the .text section are mapped into read-only executable memory at runtime, and the operating system will share memory across processes as long as it contains only position-independent code without relocatable symbols. This is exactly what we wanted.

But V8’s Code objects consist not only of the instruction stream, but also have various pieces of (sometimes isolate-dependent) metadata. Normal run-of-the-mill Code objects pack both metadata and the instruction stream into a variable-sized Code object that is located on the managed heap.

On-heap Code object layout

As we’ve seen, embedded builtins have their native instruction stream located outside the managed heap, embedded into the .text section. To preserve their metadata, each embedded builtin also has a small associated Code object on the managed heap, called the off-heap trampoline. Metadata is stored on the trampoline as for standard Code objects, while the inlined instruction stream simply contains a short sequence which loads the address of the embedded instructions and jumps there.

Off-heap Code object layout

The trampoline allows V8 to handle all Code objects uniformly. For most purposes, it is irrelevant whether the given Code object refers to standard code on the managed heap or to an embedded builtin.

Optimizing for performance

With the solution described in previous sections, embedded builtins were essentially feature-complete, but benchmarks showed that they came with significant slowdowns. For instance, our initial solution regressed Speedometer 2.0 by more than 5% overall.

We began to hunt for optimization opportunities, and identified major sources of slowdowns. The generated code was slower due to frequent indirections taken to access isolate- and process-dependent objects. Root constants were loaded from the root list (1 indirection), other heap constants from the global builtins constant pool (2 indirections), and external references additionally had to be unpacked from within a heap object (3 indirections). The worst offender was our new calling sequence, which had to load the trampoline Code object, call it, only to then jump to the target address. Finally, it appears that calls between the managed heap and binary-embedded code were inherently slower, possibly due to the long jump distance interfering with the CPU’s branch prediction.

Our work thus concentrated on 1. reducing indirections, and 2. improving the builtin calling sequence. To address the former, we altered the Isolate object layout to turn most object loads into a single root-relative load. The global builtins constant pool still exists, but only contains infrequently-accessed objects.

Optimized Isolate layout

Calling sequences were significantly improved on two fronts. Builtin-to-builtin calls were converted into a single pc-relative call instruction. This was not possible for runtime-generated JIT code since the pc-relative offset could exceed the maximal 32-bit value. There, we inlined the off-heap trampoline into all call sites, reducing the calling sequence from 6 to just 2 instructions.

With these optimizations, we were able to limit regressions on Speedometer 2.0 to roughly 0.5%.

Results

We evaluated the impact of embedded builtins on x64 over the top 10k most popular websites, and compared against both lazy- and eager deserialization (described above).

V8 heap size reduction vs. eager and lazy deserialization

Whereas previously Chrome would ship with a memory-mapped snapshot that we’d deserialize on each Isolate, now the snapshot is replaced by embedded builtins that are still memory mapped but do not need to be deserialized. The cost for builtins used to be c*(1 + n) where n is the number of Isolates and c the memory cost of all builtins, whereas now it’s just c * 1 (in practice, a small amount of per-Isolate overhead also remains for off heap trampolines).

Compared against eager deserialization, we reduced the median V8 heap size by 19%. The median Chrome renderer process size per site has decreased by 4%. In absolute numbers, the 50th percentile saves 1.9 MB, the 30th percentile saves 3.4 MB, and the 10th percentile saves 6.5 MB per site.

Significant additional memory savings are expected once bytecode handlers are also binary-embedded.

Embedded builtins are rolling out on x64 in Chrome 69, and mobile platforms will follow in Chrome 70. Support for ia32 is expected to be released in late 2018.

Note: All diagrams were generated using Vyacheslav Egorov’s awesome Shaky Diagramming tool.

Posted by Jakob Gruber (@schuay)

Tuesday, August 7, 2018

V8 release v6.9

Every six weeks, we create a new branch of V8 as part of our release process. Each version is branched from V8’s Git master immediately before a Chrome Beta milestone. Today we’re pleased to announce our newest branch, V8 version 6.9, which is in beta until its release in coordination with Chrome 69 Stable in several weeks. V8 v6.9 is filled with all sorts of developer-facing goodies. This post provides a preview of some of the highlights in anticipation of the release.

Memory savings through embedded built-ins

V8 ships with an extensive library of built-in functions. Examples are methods on built-in objects such as Array.prototype.sort and RegExp.prototype.exec, but also a wide range of internal functionality. Because their generation takes a long time, built-in functions are compiled at build-time and serialized into a snapshot, which is later deserialized at runtime to create the initial JavaScript heap state.

Built-in functions currently consume 700 KB in each Isolate (an Isolate roughly corresponds to a browser tab in Chrome). This is quite wasteful, and last year we began working on reducing this overhead. In V8 v6.4, we shipped lazy deserialization, ensuring that each Isolate only pays for the built-ins that it actually needs (but each Isolate still had its own copy).

Embedded built-ins go one step further. An embedded built-in is shared by all Isolates, and embedded into the binary itself instead of copied onto the JavaScript heap. This means that built-ins exist in memory only once regardless of how many Isolates are running, an especially useful property now that Site Isolation has been enabled by default. With embedded built-ins, we’ve seen a median 9% reduction of the V8 heap size over the top 10k websites on x64. Of these sites, 50% save at least 1.2 MB, 30% save at least 2.1 MB, and 10% save 3.7 MB or more.

V8 v6.9 ships with support for embedded built-ins on x64 platforms. Other platforms will follow soon in upcoming releases. Expect more details soon in a dedicated blog post.

Performance


Liftoff, WebAssembly’s new first-tier compiler

WebAssembly got a new baseline compiler for much faster startup of complex websites with big WebAssembly modules (such as Google Earth and AutoCAD). Depending on the hardware we are seeing speedups of more than 10×. Stay tuned for more details in a separate blog post.

Faster DataView operations

DataView methods have been reimplemented in V8 Torque, which spares a costly call to C++ compared to the former runtime implementation. Moreover, we now inline calls to DataView methods when compiling JavaScript code in TurboFan, resulting in even better peak performance for hot code. Using DataViews is now as efficient as using TypedArrays, finally making DataViews a viable choice in performance-critical situations. We’ll be covering this in more detail in an upcoming blog post about DataViews, so stay tuned!

Faster processing of WeakMaps during garbage collection

V8 v6.9 reduces Mark-Compact garbage collection pause times by improving WeakMap processing. Concurrent and incremental marking are now able to process WeakMaps, whereas previously all this work was done in the final atomic pause of Mark-Compact GC. Since not all work can be moved outside of the pause, the GC now also does more work in parallel to further reduce pause times. These optimizations essentially halved the average pause time for Mark-Compact GCs in the Web Tooling Benchmark.

WeakMap processing uses a fixed-point iteration algorithm that can degrade to quadratic runtime behavior in certain cases. With the new release, V8 is now able to switch to another algorithm that is guaranteed to finish in linear time if the GC does not finish within a certain number of iterations. Previously, worst-case examples could be constructed that took the GC a few seconds to finish even with a relatively small heap, while the linear algorithm finishes within a few milliseconds.

V8 API

Please use git log branch-heads/6.8..branch-heads/6.9 include/v8.h to get a list of the API changes.

Developers with an active V8 checkout can use git checkout -b 6.9 -t branch-heads/6.9 to experiment with the new features in V8 v6.9. Alternatively you can subscribe to Chrome’s Beta channel and try the new features out yourself soon.

Posted by the V8 team

Thursday, June 21, 2018

V8 release v6.8

Every six weeks, we create a new branch of V8 as part of our release process. Each version is branched from V8’s Git master immediately before a Chrome Beta milestone. Today we’re pleased to announce our newest branch, V8 version 6.8, which is in beta until its release in coordination with Chrome 68 Stable in several weeks. V8 v6.8 is filled with all sorts of developer-facing goodies. This post provides a preview of some of the highlights in anticipation of the release.

Memory

JavaScript functions unnecessarily kept outer functions and their metadata (known as SharedFunctionInfo or SFI) alive. Especially in function-heavy code that relies on short-lived IIFEs, this could lead to spurious memory leaks. Before this change, an active Context (i.e. an on-heap representation of a function activation) kept the SFI alive of the function that created the context:

By letting the Context point to a ScopeInfo object which contains the stripped-down information necessary for debugging, we can break the dependency on the SFI.

We’ve already observed 3% V8 memory improvements on mobile devices over a set of top 10 pages.

In parallel we have reduced the memory consumption of SFIs themselves, removing unnecessary fields or compressing them where possible, and decreased their size by ~25%, with further reductions coming in future releases. We’ve observed SFIs taking up 2–6% of V8 memory on typical websites even after detaching them from the context, so you should see memory improvements on code with a large number of functions.

Performance

Array destructuring improvements

The optimizing compiler did not generate ideal code for array destructuring. For example, swapping variables using [a, b] = [b, a] used to be twice as slow as const tmp = a; a = b; b = tmp. Once we unblocked escape analysis to eliminate all temporary allocation, array destructuring with a temporary array is as fast as a sequence of assignments.

Object.assign improvements

So far Object.assign had a fast path written in C++. That meant that the JavaScript-to-C++ boundary had to be crossed for each Object.assign call. An obvious way to improve the builtin performance was to implement a fast path on the JavaScript side. We had two options: either implement it as an native JS builtin (which would come with some unnecessary overhead in this case), or implement it using CodeStubAssembler technology (which provides more flexibility). We went with the latter solution. The new implementation of Object.assign improves the score of Speedometer2/React-Redux by about 15%, improving the total Speedometer 2 score by 1.5%.

TypedArray.prototype.sort improvements

TypedArray.prototype.sort has two paths: a fast path, used when the user does not provide a comparison function, and a slow path for everything else. Until now, the slow path reused the implementation for Array.prototype.sort, which does a lot more than is necessary for sorting TypedArrays. V8 v6.8 replaces the slow path with an implementation in CodeStubAssembler. (Not directly CodeStubAssembler but a domain-specific language that is built on top of CodeStubAssembler).

Performance for sorting TypedArrays without a comparison function stays the same while there is a speedup of up to 2.5× when sorting using a comparison function.

WebAssembly

In V8 v6.8 you can start using trap-based bounds checking on Linux x64 platforms. This memory management optimization considerably improves WebAssembly’s execution speed. It’s already used in Chrome 68, and in the future more platforms will be supported incrementally.

V8 API

Please use git log branch-heads/6.7..branch-heads/6.8 include/v8.h to get a list of the API changes.

Developers with an active V8 checkout can use git checkout -b 6.8 -t branch-heads/6.8 to experiment with the new features in V8 v6.8. Alternatively you can subscribe to Chrome’s Beta channel and try the new features out yourself soon.

Posted by the V8 team

Monday, June 11, 2018

Concurrent marking in V8

This post describes the garbage collection technique called concurrent marking. The optimization allows a JavaScript application to continue execution while the garbage collector scans the heap to find and mark live objects. Our benchmarks show that concurrent marking reduces the time spent marking on the main thread by 60%–70%. Concurrent marking is the last puzzle piece of the Orinoco project — the project to incrementally replace the old garbage collector with the new mostly concurrent and parallel garbage collector. Concurrent marking is enabled by default in Chrome 64 and Node.js v10.

Background

Marking is a phase of V8’s Mark-Compact garbage collector. During this phase the collector discovers and marks all live objects. Marking starts from the set of known live objects such as the global object and the currently active functions — the so-called roots. The collector marks the roots as live and follows the pointers in them to discover more live objects. The collector continues marking the newly discovered objects and following pointers until there are no more objects to mark. At the end of marking, all unmarked objects on the heap are unreachable from the application and can be safely reclaimed.

We can think of marking as a graph traversal. The objects on the heap are nodes of the graph. Pointers from one object to another are edges of the graph. Given a node in the graph we can find all out-going edges of that node using the hidden class of the object.

Figure 1. Object graph
Figure 1. Object graph

V8 implements marking using two mark-bits per object and a marking worklist. Two mark-bits encode three colors: white (00), grey (10), and black (11). Initially all objects are white, which means that the collector has not discovered them yet. A white object becomes grey when the collector discovers it and pushes it onto the marking worklist. A grey object becomes black when the collector pops it from the marking worklist and visits all its fields. This scheme is called tri-color marking. Marking finishes when there are no more grey objects. All the remaining white objects are unreachable and can be safely reclaimed.

Figure 2. Marking starts from the roots.
Figure 2. Marking starts from the roots.
Figure 3. The collector turns a grey object into black by processing its pointers.
Figure 3. The collector turns a grey object into black by processing its pointers.
Figure 4. The final state after marking is finished
Figure 4. The final state after marking is finished.

Note that the marking algorithm described above works only if the application is paused while marking is in progress. If we allow the application to run during marking, then the application can change the graph and eventually trick the collector into freeing live objects.

Reducing marking pause

Marking performed all at once can take several hundred milliseconds for large heaps.

Such long pauses can make applications unresponsive and result in poor user experience. In 2011 V8 switched from the stop-the-world marking to incremental marking. During incremental marking the garbage collector splits up the marking work into smaller chunks and allows the application to run between the chunks:

The garbage collector chooses how much incremental marking work to perform in each chunk to match the rate of allocations by the application. In common cases this greatly improves the responsiveness of the application. For large heaps under memory pressure there can still be long pauses as the collector tries to keep up with the allocations.

Incremental marking does not come for free. The application has to notify the garbage collector about all operations that change the object graph. V8 implements the notification using a Dijkstra-style write-barrier. After each write operation of the form object.field = value in JavaScript, V8 inserts the write-barrier code:

// Called after `object.field = value`.
write_barrier(object, field_offset, value) {
  if (color(object) == black && color(value) == white) {
    set_color(value, grey);
    marking_worklist.push(value);
  }
}

The write-barrier enforces the invariant that no black object points to a white object. This is also known as the strong tri-color invariant and guarantees that the application cannot hide a live object from the garbage collector, so all white objects at the end of marking are truly unreachable for the application and can be safely freed.

Incremental marking integrates nicely with idle time garbage collection scheduling as described in an earlier blog post. Chrome’s Blink task scheduler can schedule small incremental marking steps during idle time on the main thread without causing jank. This optimization works really well if idle time is available.

Because of the write-barrier cost, incremental marking may reduce throughput of the application. It is possible to improve both throughput and pause times by making use of additional worker threads. There are two ways to do marking on worker threads: parallel marking and concurrent marking.

Parallel marking happens on the main thread and the worker threads. The application is paused throughout the parallel marking phase. It is the multi-threaded version of the stop-the-world marking.

Concurrent marking happens mostly on the worker threads. The application can continue running while concurrent marking is in progress.

The following two sections describe how we added support for parallel and concurrent marking in V8.

Parallel marking

During parallel marking we can assume that the application is not running concurrently. This substantially simplifies the implementation because we can assume that the object graph is static and does not change. In order to mark the object graph in parallel, we need to make the garbage collector data structures thread-safe and find a way to efficiently share marking work between threads. The following diagram shows the data-structures involved in parallel marking. The arrows indicate the direction of data flow. For simplicity, the diagram omits data-structures that are needed for heap defragmentation.

Figure 5. Data-structures for parallel marking
Figure 5. Data structures for parallel marking

Note that the threads only read from the object graph and never change it. The mark-bits of the objects and the marking worklist have to support read and write accesses.

Marking worklist and work stealing

The implementation of the marking worklist is critical for performance and balances fast thread-local performance with how much work can be distributed to other threads in case they run out of work to do.

The extreme sides in that trade-off space are (a) using a completely concurrent data structure for best sharing as all objects can potentially be shared and (b) using a completely thread-local data structure where no objects can be shared, optimizing for thread-local throughput. Figure 6 shows how V8 balances these needs by using a marking worklist that is based on segments for thread-local insertion and removal. Once a segment becomes full it is published to a shared global pool where it is available for stealing. This way V8 allows marking threads to operate locally without any synchronization as long as possible and still handle cases where a single thread reaches a new sub-graph of objects while another thread starves as it completely drained its local segments.

Figure 6. Marking worklist
Figure 6. Marking worklist

Concurrent marking

Concurrent marking allows JavaScript to run on the main thread while worker threads are visiting objects on the heap. This opens the door for many potential data races. For example, JavaScript may be writing to an object field at the same time as a worker thread is reading the field. The data races may confuse the garbage collector to free a live object or to mix up primitive values with pointers.

Each operation on the main thread that changes the object graph is a potential source of a data race. Since V8 is a high-performance engine with many object layout optimizations, the list of potential data race sources is rather long. Here is a high-level breakdown:

  • Object allocation.
  • Write to an object field.
  • Object layout changes.
  • Deserialization from the snapshot.
  • Materialization during deoptimization of a function.
  • Evacuation during young generation garbage collection.
  • Code patching.

The main thread needs to synchronize with the worker threads on these operations. The cost and complexity of synchronization depends on the operation. Most operations allow lightweight synchronization with atomic memory accesses, but a few operations require exclusive access to the object. In the following subsections we highlight some of the interesting cases.

Write barrier

The data race caused by a write to an object field is resolved by turning the write operation into a relaxed atomic write and tweaking the write barrier:

// Called after atomic_relaxed_write(&object.field, value);
write_barrier(object, field_offset, value) {
  if (color(value) == white && atomic_color_transition(value, white, grey)) {
    marking_worklist.push(value);
  }
}

Compare it with the previously used write barrier:

// Called after `object.field = value`.
write_barrier(object, field_offset, value) {
  if (color(object) == black && color(value) == white) {
    set_color(value, grey);
    marking_worklist.push(value);
  }
}

There are two changes:

  1. The color check of the source object (color(object) == black) is gone.
  2. The color transition of the value from white to grey happens atomically.

Without the source object color check the write barrier becomes more conservative, i.e. it may mark objects as live even if those objects are not really reachable. We removed the check to avoid an expensive memory fence that would be needed between the write operation and the write barrier:

atomic_relaxed_write(&object.field, value);
memory_fence();
write_barrier(object, field_offset, value);

Without the memory fence the object color load operation can be reordered before the write operation. If we don’t prevent the reordering, then the write barrier may observe grey object color and bail out, while a worker thread marks the object without seeing the new value. The original write barrier proposed by Dijkstra et al. also does not check the object color. They did it for simplicity, but we need it for correctness.

Bailout worklist

Some operations, for example code patching, require exclusive access to the object. Early on we decided to avoid per-object locks because they can lead to the priority inversion problem, where the main thread has to wait for a worker thread that is descheduled while holding an object lock. Instead of locking an object, we allow the worker thread to bailout from visiting the object. The worker thread does that by pushing the object into the bailout worklist, which is processed only by the main thread:

Figure 7. The bailout worklist
Figure 7. The bailout worklist

Worker threads bail out on optimized code objects, hidden classes and weak collections because visiting them would require locking or expensive synchronization protocol.

In retrospect, the bailout worklist turned out to be great for incremental development. We started implementation with worker threads bailing out on all object types and added concurrency one by one.

Object layout changes

A field of an object can store three kinds of values: a tagged pointer, a tagged small integer (also known as a Smi), or an untagged value like an unboxed floating-point number. Pointer tagging is a well-known technique that allows efficient representation of unboxed integers. In V8 the least significant bit of a tagged value indicates whether it is a pointer or an integer. This relies on the fact that pointers are word-aligned. The information about whether a field is tagged or untagged is stored in the hidden class of the object.

Some operations in V8 change an object field from tagged to untagged (or vice versa) by transitioning the object to another hidden class. Such an object layout change is unsafe for concurrent marking. If the change happens while a worker thread is visiting the object concurrently using the old hidden class, then two kinds of bugs are possible. First, the worker may miss a pointer thinking that it is an untagged value. The write barrier protects against this kind of bug. Second, the worker may treat an untagged value as a pointer and dereference it, which would result in an invalid memory access typically followed by a program crash. In order to handle this case we use a snapshotting protocol that synchronizes on the mark-bit of the object. The protocol involves two parties: the main thread changing an object field from tagged to untagged and the worker thread visiting the object. Before changing the field, the main thread ensures that the object is marked as black and pushes it into the bailout worklist for visiting later on:

atomic_color_transition(object, white, grey);
if (atomic_color_transition(object, grey, black)) {
  // The object will be revisited on the main thread during draining
  // of the bailout worklist.
  bailout_worklist.push(object);
}
unsafe_object_layout_change(object);

As shown in the code snippet below, the worker thread first loads the hidden class of the object and snapshots all the pointer fields of the object specified by the hidden class using atomic relaxed load operations. Then it tries to mark the object black using an atomic compare and swap operation. If marking succeeded then this means that the snapshot must be consistent with the hidden class because the main thread marks the object black before changing its layout.

snapshot = [];
hidden_class = atomic_relaxed_load(&object.hidden_class);
for (field_offset in pointer_field_offsets(hidden_class)) {
  pointer = atomic_relaxed_load(object + field_offset);
  snapshot.add(field_offset, pointer);
}
if (atomic_color_transition(object, grey, black)) {
  visit_pointers(snapshot);
}

Note that a white object that undergoes an unsafe layout change has to be marked on the main thread. Unsafe layout changes are relatively rare, so this does not have a big impact on performance of real world applications.

Putting it all together

We integrated concurrent marking into the existing incremental marking infrastructure. The main thread initiates marking by scanning the roots and filling the marking worklist. After that it posts concurrent marking tasks on the worker threads. The worker threads help the main thread to make faster marking progress by cooperatively draining the marking worklist. Once in a while the main thread participates in marking by processing the bailout worklist and the marking worklist. Once the marking worklists become empty, the main thread finalizes garbage collection. During finalization the main thread re-scans the roots and may discover more white objects. Those objects are marked in parallel with the help of worker threads.

Results

Our real-world benchmarking framework shows about 65% and 70% reduction in main thread marking time per garbage collection cycle on mobile and desktop respectively.

Concurrent marking also reduces garbage collection jank in Node.js. This is particularly important since Node.js never implemented idle time garbage collection scheduling and therefore was never able to hide marking time in non-jank-critical phases. Concurrent marking shipped in Node.js v10.

Posted by Ulan Degenbaev, Michael Lippautz, and Hannes Payer — main thread liberators

Friday, May 4, 2018

V8 release v6.7

Every six weeks, we create a new branch of V8 as part of our release process. Each version is branched from V8’s Git master immediately before a Chrome Beta milestone. Today we’re pleased to announce our newest branch, V8 version 6.7, which is in beta until its release in coordination with Chrome 67 Stable in several weeks. V8 v6.7 is filled with all sorts of developer-facing goodies. This post provides a preview of some of the highlights in anticipation of the release.

JavaScript language features

V8 v6.7 ships with BigInt support enabled by default. BigInts are a new numeric primitive in JavaScript that can represent integers with arbitrary precision. Read the Web Fundamentals article on BigInt for more info on how BigInts can be used in JavaScript, and check out our write-up with more details about the V8 implementation.

Untrusted code mitigations

In V8 v6.7 we’ve landed more mitigations for side-channel vulnerabilities to prevent information leaks to untrusted JavaScript and WebAssembly code.

V8 API

Please use git log branch-heads/6.6..branch-heads/6.7 include/v8.h to get a list of the API changes.

Developers with an active V8 checkout can use git checkout -b 6.7 -t branch-heads/6.7 to experiment with the new features in V8 v6.7. Alternatively you can subscribe to Chrome’s Beta channel and try the new features out yourself soon.

Posted by the V8 team

Wednesday, May 2, 2018

Adding BigInts to V8

Over the past couple of months, we have implemented support for BigInts in V8, as currently specified by this proposal, to be included in a future version of ECMAScript. The following post tells the story of our adventures.

TL;DR

As a JavaScript programmer, you now1 have integers with arbitrary2 precision in your toolbox:

const a = 2172141653n;
const b = 15346349309n;
a * b;
// → 33334444555566667777n     // Yay!
Number(a) * Number(b);
// → 33334444555566670000      // Boo!
const such_many = 2n ** 222n;
// → 6739986666787659948666753771754907668409286105635143120275902562304n

For details about the new functionality and how it could be used, refer to our in-depth Web Fundamentals article on BigInt. We are looking forward to seeing the awesome things you’ll build with them!

1 Now if you run Chrome Beta, Dev, or Canary, or a preview Node.js version, otherwise soon (Chrome 67, Node.js master probably around the same time).

2 Arbitrary up to an implementation-defined limit. Sorry, we haven’t yet figured out how to squeeze an infinite amount of data into your computer’s finite amount of memory.

Representing BigInts in memory

Typically, computers store integers in their CPU’s registers (which nowadays are usually 32 or 64 bits wide), or in register-sized chunks of memory. This leads to the minimum and maximum values you might be familiar with. For example, a 32-bit signed integer can hold values from -2,147,483,648 to 2,147,483,647. The idea of BigInts, however, is to not be restricted by such limits.

So how can one store a BigInt with a hundred, or a thousand, or a million bits? It can’t fit in a register, so we allocate an object in memory. We make it large enough to hold all the BigInt’s bits, in a series of chunks, which we call “digits” — because this is conceptually very similar to how one can write bigger numbers than “9” by using more digits, like in “10”; except where the decimal system uses digits from 0 to 9, our BigInts use digits from 0 to 4294967295 (i.e. 2**32-1). That’s the value range of a 32-bit CPU register3, without a sign bit; we store the sign bit separately. In pseudo-code, a BigInt object with 3*32 = 96 bits looks like this:

{
  type: 'BigInt',
  sign: 0,
  num_digits: 3,
  digits: [0x12…, 0x34…, 0x56…],
}

3 On 64-bit machines, we use 64-bit digits, i.e. from 0 to 18446744073709551615 (i.e. 2n**64n-1n).

Back to school, and back to Knuth

Working with integers kept in CPU registers is really easy: to e.g. multiply two of them, there’s a machine instruction which software can use to tell the CPU “multiply the contents of these two registers!”, and the CPU will do it. For BigInt arithmetic, we have to come up with our own solution. Thankfully this particular task is something that quite literally every child at some point learns how to solve: remember what you did back in school when you had to multiply 345 * 678 and weren’t allowed to use a calculator?

345 * 678
---------
     30    //   5 * 6
+   24     //  4  * 6
+  18      // 3   * 6
+     35   //   5 *  7
+    28    //  4  *  7
+   21     // 3   *  7
+      40  //   5 *   8
+     32   //  4  *   8
+    24    // 3   *   8
=========
   233910

That’s exactly how V8 multiplies BigInts: one digit at a time, adding up the intermediate results. The algorithm works just as well for 0 to 9 as it does for a BigInt’s much bigger digits.

Donald Knuth published a specific implementation of multiplication and division of large numbers made up of smaller chunks in Volume 2 of his classic The Art of Computer Programming, all the way back in 1969. V8’s implementation follows this book, which shows that this a pretty timeless piece of computer science.

“Less desugaring” == more sweets?

Perhaps surprisingly, we had to spend quite a bit of effort on getting seemingly simple unary operations, like -x, to work. So far, -x did exactly the same as x * (-1), so to simplify things, V8 applied precisely this replacement as early as possible when processing JavaScript, namely in the parser. This approach is called “desugaring”, because it treats an expression like -x as “syntactic sugar” for x * (-1). Other components (the interpreter, the compiler, the entire runtime system) didn’t even need to know what a unary operation is, because they only ever saw the multiplication, which of course they must support anyway.

With BigInts, however, this implementation suddenly becomes invalid, because multiplying a BigInt with a Number (like -1) must throw a TypeError4. The parser would have to desugar -x to x * (-1n) if x is a BigInt — but the parser has no way of knowing what x will evaluate to. So we had to stop relying on this early desugaring, and instead add proper support for unary operations on both Numbers and BigInts everywhere.

4 Mixing BigInt and Number operand types is generally not allowed. That’s somewhat unusual for JavaScript, but there is an explanation for this decision.

A bit of fun with bitwise ops

Most computer systems in use today store signed integers using a neat trick called “two’s complement”, which has the nice properties that the first bit indicates the sign, and adding 1 to the bit pattern always increments the number by 1, taking care of the sign bit automatically. For example, for 8-bit integers:

  • 10000000 is -128, the lowest representable number,
  • 10000001 is -127,
  • 11111111 is -1,
  • 00000000 is 0,
  • 00000001 is 1,
  • 01111111 is 127, the highest representable number.

This encoding is so common that many programmers expect it and rely on it, and the BigInt specification reflects this fact by prescribing that BigInts must act as if they used two’s complement representation. As described above, V8’s BigInts don’t!

To perform bitwise operations according to spec, our BigInts therefore must pretend to be using two’s complement under the hood. For positive values, it doesn’t make a difference, but negative numbers must do extra work to accomplish this. That has the somewhat surprising effect that a & b, if a and b are both negative BigInts, actually performs four steps (as opposed to just one if they were both positive): both inputs are converted to fake-two’s-complement format, then the actual operation is done, then the result is converted back to our real representation. Why the back-and-forth, you might ask? Because all the non-bitwise operations are much easier that way.

Two new types of TypedArrays

The BigInt proposal includes two new TypedArray flavors: BigInt64Array and BigUint64Array. We can have TypedArrays with 64-bit wide integer elements now that BigInts provide a natural way to read and write all the bits in those elements, whereas if one tried to use Numbers for that, some bits might get lost. That’s why the new arrays aren’t quite like the existing 8/16/32-bit integer TypedArrays: accessing their elements is always done with BigInts; trying to use Numbers throws an exception.

> const big_array = new BigInt64Array(1);
> big_array[0] = 123n;  // OK
> big_array[0]
123n
> big_array[0] = 456;
TypeError: Cannot convert 456 to a BigInt
> big_array[0] = BigInt(456);  // OK

Just like JavaScript code working with these types of arrays looks and works a bit different from traditional TypedArray code, we had to generalize our TypedArray implementation to behave differently for the two newcomers.

Optimization considerations

For now, we are shipping a baseline implementation of BigInts. It is functionally complete and should provide solid performance (a little bit faster than existing userland libraries), but it is not particularly optimized. The reason is that, in line with our aim to prioritize real-world applications over artificial benchmarks, we first want to see how you will use BigInts, so that we can then optimize precisely the cases you care about!

For example, if we see that relatively small BigInts (up to 64 bits) are an important use case, we could make those more memory-efficient by using a special representation for them:

{
  type: 'BigInt-Int64',
  value: 0x12…,
}

One of the details that remain to be seen is whether we should do this for “int64” value ranges, “uint64” ranges, or both — keeping in mind having to support fewer fast paths means that we can ship them sooner, and also that every additional fast path ironically makes everything else a bit slower, because affected operations always have to check whether it is applicable.

Another story is support for BigInts in the optimizing compiler. For computationally heavy applications operating on 64-bit values and running on 64-bit hardware, keeping those values in registers would be much more efficient than allocating them as objects on the heap as we currently do. We have plans for how we would implement such support, but it is another case where we would first like to find out whether that is really what you, our users, care about the most; or whether we should spend our time on something else instead.

Please send us feedback on what you’re using BigInts for, and any issues you encounter! You can reach us at our bug tracker crbug.com/v8/new, via mail to v8-users@googlegroups.com, or @v8js on Twitter.

Posted by Jakob Kummerow, arbitrator of precision

Tuesday, April 24, 2018

Improved code caching

V8 uses code caching to cache the generated code for frequently-used scripts. Starting with Chrome 66, we are caching more code by generating the cache after top-level execution. This leads to a 20-40% reduction in parse and compilation time during the initial load.

Background

V8 uses two kinds of code caching to cache generated code to be reused later. The first is the in-memory cache that is available within each instance of V8. The code generated after the initial compile is stored into this cache, keyed on the source string. This is available for reuse within the same instance of V8. The other kind of code caching serializes the generated code and stores it on disk for future use. This cache is not specific to a particular instance of V8 and can be used across different instances of V8. This blog post focuses on this second kind of code caching as used in Chrome. (Other embedders also use this kind of code caching; it’s not limited to Chrome. However, this blog post only focuses on the usage in Chrome.)

Chrome stores the serialized generated code onto the disk cache and keys it with the URL of the script resource. When loading a script, Chrome checks the disk cache. If the script is already cached, Chrome passes the serialized data to V8 as a part of compile request. V8 then deserializes this data instead of parsing and compiling the script. There are also additional checks involved to ensure that the code is still usable (for example: a version mismatch makes the cached data unusable).

Real-world data shows that the code cache hit rates (for scripts that could be cached) is high (~86%). Though the cache hit rates are high for these scripts, the amount of code we cache per script is not very high. Our analysis showed that increasing the amount of code that is cached would reduce the time spent in parsing and compiling JavaScript code by around 40%.

Increasing the amount of code that is cached

In the previous approach, code caching was coupled with the requests to compile the script.

Embedders could request that V8 serialize the code it generated during its top-level compilation of a new JavaScript source file. V8 returned the serialized code after compiling the script. When Chrome requests the same script again, V8 fetches the serialized code from the cache and deserializes it. V8 completely avoids recompiling functions that are already in the cache. These scenarios are shown in the following figure:

V8 only compiles the functions that are expected to be immediately executed (IIFEs) during the top-level compile and marks other functions for lazy compilation. This helps improve page load times by avoiding compiling functions that are not required, however it means that the serialized data only contains the code for the functions that are eagerly compiled.

Prior to Chrome 59, we had to generate the code cache before any execution has started. The earlier baseline compiler of V8 (Full-codegen) generates specialized code for the execution context. Full-codegen used code patching to fast-path operations for the specific execution context. Such code cannot be serialized easily by removing the context specific data to be used in other execution contexts.

With the launch of Ignition in Chrome 59, this restriction is no longer necessary. Ignition uses data-driven inline caches to fast-path operations in the current execution context. The context-dependent data is stored in feedback vectors and is separate from the generated code. This has opened the possibility of generating code caches even after the execution of the script. As we execute the script, more functions (that were marked for lazy compile) are compiled, allowing us to cache more code.

V8 exposes a new API, ScriptCompiler::CreateCodeCache, to request code caches independent of the compile requests. Requesting code caches along with compile requests is deprecated and would not work in V8 v6.6 onwards. Since version 66, Chrome uses this API to request the code cache after the top-level execute. The following figure shows the new scenario of requesting the code cache. The code cache is requested after the top level execute and hence contains the code for functions that were compiled later during the execution of the script. In the later runs (shown as hot runs in the following figure), it avoids compilation of functions during top level execute.

Results

The performance of this feature is measured using our internal real-world benchmarks. The following graph shows the reduction in the parse and compile time over the earlier caching scheme. There is a reduction of around 20–40% in both parse and compilation time on most of the pages.

Data from the wild shows similar results with a 20–40% reduction in the time spent in compiling JavaScript code both on desktop and mobile. On Android, this optimization also translates to a 1–2% reduction in the top-level page-load metrics like the time a webpage takes to become interactive. We also monitored the memory and disk usage of Chrome and did not see any noticeable regressions.

Posted by Mythri Alle, Chief Code Cacher