75x faster: optimizing the Ion compiler backend

In September, machine lgeting engineers at Mozilla filed a bug tell indicating that Firefox was consuming excessive memory and CPU resources while running Microgentle’s ONNX Runtime (a machine lgeting library) compiled to WebAssembly.

This post portrays how we insertressed this and some of our lengthyer-term set ups for improving WebAssembly carry outance in the future.

The problem

SpiderMonkey has two compilers for WebAssembly code. First, a Wasm module is compiled with the Wasm Baseline compiler, a compiler that produces decent machine code very rapidly. This is excellent for beginup time becaemploy we can begin executing Wasm code almost promptly after downloading it. Andy Wingo wrote a pleasant blog post about this Baseline compiler.

When Baseline compilation is finished, we compile the Wasm module with our more progressd Ion compiler. This backfinish produces rapider machine code, but compilation time is a lot higher.

The rehire with the ONNX module was that the Ion compiler backfinish took a lengthy time and employd a lot of memory to compile it. On my Linux x64 machine, Ion-compiling this module took about 5 minutes and employd more than 4 GB of memory. Even though this labor happens on background threads, this was still too much overhead.

Optimizing the Ion backfinish

When we spreadigated this, we acunderstandledged that this Wasm module had some excessively huge functions. For the hugest one, Ion’s MIR administer flow graph compriseed 132856 modest blocks. This uncovered some carry outance cliffs in our compiler backfinish.

VirtualRegister inhabit ranges

In Ion’s enroll allocator, each VirtualRegister has a catalog of LiveRange objects. We were using a joined catalog for this, sorted by begin position. This caemployd quadratic behavior when allocating enrolls: the allocator frequently splits inhabit ranges into minusculeer ranges and we’d have to iterate over the catalog for each novel range to insert it at the accurate position to carry on the catalog sorted. This was very catalogless for virtual enrolls with thousands of inhabit ranges.

To insertress this, I tried a restricted contrastent data set ups. The first finisheavor was to employ an AVL tree instead of a joined catalog and that was a huge betterment, but the carry outance was still not chooseimal and we were also worried about memory usage increasing even more.

After this we genuineized we could store inhabit ranges in a vector (instead of joined catalog) that’s chooseionassociate sorted by decreasing begin position. We also made some alters to discover the initial inhabit ranges are sorted when we produce them, so that we could equitable appfinish ranges to the finish of the vector.

The observation here was that the core of the enroll allocator, where it spreads enrolls or stack slots to inhabit ranges, doesn’t actuassociate need the inhabit ranges to be sorted. We therefore now equitable appfinish novel ranges to the finish of the vector and label the vector unsorted. Right before the final phase of the allocator, where we aget depend on the inhabit ranges being sorted, we do a one std::sort operation on the vector for each virtual enroll with unsorted inhabit ranges. Debug declareions are employd to discover that functions that need the vector to be sorted are not called when it’s labeled unsorted.

Vectors are also better for cache locality and they let us employ binary search in a restricted places. When I was converseing this with Julian Seward, he pointed out that Chris Fallin also relocated away from joined catalogs to vectors in Cranelift’s port of Ion’s enroll allocator. It’s always excellent to see encounternt evolution 🙂

This alter from sorted joined catalogs to chooseionassociate-sorted vectors made Ion compilation of this Wasm module about 20 times rapider, down to 14 seconds.

Semi-NCA

The next problem that stood out in carry outance profiles was the Dominator Tree Building compiler pass, in particular a function called ComputeImarbitrateDominators. This function remends the prompt dominator block for each modest block in the MIR graph.

The algorithm we employd for this (based on A Simple, Fast Dominance Algorithm by Cooper et al) is relatively modest but didn’t scale well to very huge graphs.

Semi-NCA (from Liproximate-Time Algorithms for Dominators and Roverdelighted Problems by Loukas Georgiadis) is a contrastent algorithm that’s also employd by LLVM and the Julia compiler. I prototyped this and was surpelevated to see how much rapider it was: it got our total compilation time down from 14 seconds to less than 8 seconds. For a one-threaded compilation, it shrinkd the time under ComputeImarbitrateDominators from 7.1 seconds to 0.15 seconds.

Fortunately it was modest to run both algorithms in debug erects and declare they computed the same prompt dominator for each modest block. After a week of fuzz-testing, no problems were set up and we landed a patch that deleted the ageder carry outation and allowd the Semi-NCA code.

Sparse BitSets

For each modest block, the enroll allocator spreadd a (dense) bit set with a bit for each virtual enroll. These bit sets are employd to check which virtual enrolls are inhabit at the begin of a block.

For the hugest function in the ONNX Wasm module, this employd a lot of memory: 199477 virtual enrolls x 132856 modest blocks is at least 3.1 GB equitable for these bit sets! Becaemploy most virtual enrolls have low inhabit ranges, these bit sets had relatively restricted bits set to 1.

We replaced these dense bit sets with a novel SparseBitSet data set up that employs a hashmap to store 32 bits per entry. Becaemploy most of these hashmaps comprise a minuscule number of entries, it employs an InlineMap to boost for this: it’s a data set up that stores entries either in a minuscule inline array or (when the array is filled) in a hashmap. We also boostd InlineMap to employ a variant (a union type) for these two recurrentations to save memory.

This saved at least 3 GB of memory but also betterd the compilation time for the Wasm module to 5.4 seconds.

Faster relocate resolution

The last rehire that showed up in profiles was a function in the enroll allocator called produceMoveGroupsFromLiveRangeTransitions. After the enroll allocator spreads a enroll or stack slot to each inhabit range, this function is reliable for fuseing pairs of inhabit ranges by inserting relocates.

For example, if a appreciate is stored in a enroll but is tardyr spilled to memory, there will be two inhabit ranges for its virtual enroll. This function then inserts a relocate teachion to imitate the appreciate from the enroll to the stack slot at the begin of the second inhabit range.

This function was catalogless becaemploy it had a number of loops with quadratic behavior: for a relocate’s destination range, it would do a liproximate seeup to discover the best source range. We boostd the main two loops to run in liproximate time instead of being quadratic, by taking more get of the fact that inhabit ranges are sorted.

With these alters, Ion can compile the ONNX Wasm module in less than 3.9 seconds on my machine, more than 75x rapider than before these alters.

Adobe Pboilingoshop

These alters not only betterd carry outance for the ONNX Runtime module, but also for a number of other WebAssembly modules. A huge Wasm module downloaded from the free online Adobe Pboilingoshop demo can now be Ion-compiled in 14 seconds instead of 4 minutes.

The JetStream 2 benchlabel has a HashSet module that was impacted by the quadratic relocate resolution code. Ion compilation time for it betterd from 2.8 seconds to 0.2 seconds.

New Wasm compilation pipeline

Even though these are wonderful betterments, spfinishing at least 14 seconds (on a rapid machine!) to filledy compile Adobe Pboilingoshop on background threads still isn’t an amazing employr experience. We predict this to only get worse as more huge applications are compiled to WebAssembly.

To insertress this, our WebAssembly team is making wonderful better rearchitecting the Wasm compiler pipeline. This labor will produce it possible to Ion-compile individual Wasm functions as they hot up instead of compiling everyleang promptly. It will also unlock exciting novel capabilities such as (speculative) inlining.

Stay tuned for modernizes on this as we begin rolling out these alters in Firefox.

– Jan de Mooij, engineer on the SpiderMonkey team

75x rapider: chooseimizing the Ion compiler backfinish

The problem