Looking Ahead at Intel’s Xe3 GPU Architecture

Intel’s foray into high carry outance explicits has finishelighted astonishive progress over the past restricted years, and the company is not letting up on the gas. Tom Peterson from Intel has proposed that Xe3 challengingware depict is finish, and gentleware toil is underway. Some of that gentleware toil is evident apass disjoinal contrastent uncover source repositories, proposeing a pappraise of what’s to come.

Modern GPUs are built from a hierarchy of subdivision levels, letting them scale to hit contrastent carry outance, power and price aims. A shader program running on an Intel GPU can verify where it’s running by reading the low bits of the sr0 (state enroll 0) architectural enroll.

sr0 topology bits on Xe3 have a contrastent layout¹. Xe Cores wiskinny a Rfinisher Slice are enumerated with four bits, up from two in prior generations. Thus Xe3’s topology bits would be able to handle a Rfinisher Slice with up to 16 Xe Cores. Prior Xe generations could only have four Xe Cores per Rfinisher Slice, and frequently went right up to that. The B580 and A770 both placed four Xe Cores in each Rfinisher Slice.

Having enough bits to portray a certain configuration doesn’t uncomfervent Intel will ship someskinnyg that huge. Xe did employ its highest 32 core, 4096 lane setup in the Arc A770. However, Xe2 maxed out at 20 cores and 2560 lanes with the Arc B580. Xe2’s sr0 establishat could theoreticassociate enumerate 16 slices. Giving each slice the highest of 4 Xe Cores would produce a 64 Xe Core GPU with 8192 FP32 lanes. Obviously the B580 doesn’t get anywhere csurrfinisher that.

Visualizing the shader array on a hypothetical enormous Xe2 perestablishation that maxes out all topology enumeration bits

Xe3 goes even further. Maxing out all the topology enumeration bits would result in a ludicrously big 256 Xe Core configuration with 32768 FP32 lanes. That’s even bigr than Nvidia’s RTX 5090, which “only” has 21760 FP32 lanes. Intel has been concentrateing on the midrange segment for a while, and I mistrust we’ll see anyskinnyg that huge.

Instead, I skinnyk Intel wants more flexibility to scale compute power self-reliantly of mended function challengingware enjoy ROPs and rasterizers. AMD and Nvidia’s SAs and GPCs all pack a lot more than four cores. For example, the RX 6900XT’s Shader Arrays each have 10 WGPs. Nvidia’s RTX 4090 puts eight SMs in each GPC. GPUs have become more compute-burdensome over time, as games employ more complicated shader programs. Intel seems to be adhereing the same trfinish.

Xe Vector Engines (XVEs) perestablish shader programs on Intel GPUs. They employ a combination of vector-level and thread-level parallelism to hide postponeedncy.

Xe3 XVEs can run 10 threads concurrently, up from eight in prior generations. Like SMT on a CPU, tracking multiple threads helps a XVE hide postponeedncy using thread level parallelism. If one thread sloftys, the XVE can hopefilledy discover an un-sloftyed thread to rerent teachions from. Active thread count is also referred to as thread occupancy. 100% occupancy on a GPU would be analogous to 100% utilization in Windows Task Manager. Unenjoy CPU SMT perestablishations, GPU occupancy can be restricted by enroll file capacity.

Prior Intel GPUs had two enroll allocation modes. Normassociate each thread gets 128 512-bit enrolls, for 8 KB of enrolls per thread. A “big GRF” mode gives each thread 256 enrolls, but drops occupancy to 4 threads becaemploy of enroll file capacity restricts. Xe3 progresss to employ 64 KB enroll files per XVE, but flexibly allots enrolls in 32 entry blocks². That lets Xe3’s XVEs get 10 threads in fweightless as extfinished as each thread employs 96 or restricteder enrolls. If a shader program needs a lot of enrolls, occupancy degrades more gracefilledy than in prior generations.

Nvidia and AMD GPUs allot enrolls at even finer granularity. AMD’s RDNA 2 for example allots enrolls in blocks of 16. But Xe3 is still more pliable than prior Intel generations. With this alter, basic shaders that only need a restricted enrolls will finishelight better postponeedncy tolerance from more thread-level parallelism. And more complicated shaders can evade dropping to the “big GRF” mode.

Xe3’s XVEs have more scoreboard tokens too. Like AMD and Nvidia, Intel employs compiler aided scheduling for extfinished postponeedncy teachions enjoy memory accesses. A extfinished postponeedncy teachion can set a scoreboard entry, and a subordinate teachion can postpone until that entry is evidented. Each Xe3 thread gets 32 scoreboard tokens seeless of occupancy, so a XVE has 320 scoreboard tokens in total. On Xe2, a thread gets 16 tokens if the XVE is running eight threads, or 32 in “big GRF” mode with four threads. Thus Xe2’s XVEs only have 128 scoreboard tokens in total. More tokens let thread have more exceptional extfinished postponeedncy teachions. That very probable transpostponeeds to more memory level parallelism per thread.

Intel’s GPU ISA has a vector enroll file (GRF, or General Register File) that stores much of a shader program’s data and feeds the vector execution units. It also has an “Architecture Register File” (ARF) with distinctive enrolls. Some of those can store data, enjoy the accumulator enrolls. But others serve distinctive purposes. For example, sr0 as refered above provides GPU topology info, aextfinished with floating point exception state and thread priority. A 32-bit teachion pointer points to the current teachion insertress, relative to the teachion base insertress.

Notes on Intel’s ARF, with alters from Xe2 to Xe3 in blue

Xe3 inserts a “Scalar Register” (s0) to the ARF⁶. s0 is laid out much enjoy the insertress enroll (a0), and is employd for collect-sfinish teachions. XVEs access memory and convey with other allotd using by sfinishing messages over the Xe Core’s message fabric, using sfinish teachions. Gather-sfinish materializes to let Xe3 collect non-contiguous appreciates from the enroll file, and sfinish them with a individual sfinish teachion.

Besides inserting the Scalar Register, Xe3 extfinishs the thread depfinishency enroll (TDR) to handle 10 threads. sr0 obtains an extra 32-bit doubleword for uncomprehendn reasons.

Xe3 helps a saturation modifier for FCVT, an teachion that changes between contrastent floating point types (not between integer and floating point). FCVT was begind with Ponte Vecchio, but the saturation modifier could ease conversion from higher to lessen precision floating point establishats. Xe3 also obtains HF8 (half float 8-bit) establishat help, providing another 8-bit floating point establishat selection next to the BF8 type already helped in Xe2.

For the XMX unit, Xe3 obtains a xdpas teachion⁴. sdpas stands for sparse systolic dot product with accumupostponeed⁵. Matrices with a lot of zero elements are comprehendn as sparse matrices. Operations on sparse matrices can be upgraded becaemploy anyskinnyg multiplied by zero is evidently zero. Nvidia and AMD GPUs have both perestablished sparsity selectimizations, and Intel is apparently seeing to do the same.

Sub-Triangle Opacity Culling (STOC) subsplits triangles in BVH leaf nodes, and labels sub-triangles as see-thharsh, nontransparent, or partiassociate see-thharsh. The primary motivation is to lessen misemployd any-hit shader toil when games employ texture alpha channels to handle complicated geometry. Intel’s paper calls out foliage as an example, noting that programmers may employ low vertex counts to lessen “rfinishering, animation, and even simulation run times.”⁷ BVH geometry from the API perspective can only be finishly see-thharsh or nontransparent, so games label all partiassociate see-thharsh primitives as see-thharsh. Each ray intersection will fire an any-hit shader, which carries out alpha testing. If alpha testing proposes the ray intersected a see-thharsh part of the primitive, the shader program doesn’t give a sample and the any-hit shader begin is basicassociate misemployd. STOC bits let the any-hit shader skip alpha testing if the ray intersects a finishly see-thharsh or finishly nontransparent sub-triangle.

From Intel’s paper, shoprosperg examples of foliage textures

Storing each sub-triangle’s opacity adviseation consents two bits, so STOC does need more storage contrastd to using a individual opacity bit for the entire triangle. Still, it’s far more pragmatic than packing entire textures into the BVH. Intel’s paper set up that a gentleware-only STOC perestablishation increased carry outance by 5.9-42.2% contrastd to standard alpha tests when handling translucent ray-chased shadows.

Elden Ring’s BVH reconshort-terms tree departs in Leyndell using bigr triangles, as seen in Radeon Raytracing Analyzer. STOC may map well to this scenario

STOC-alerted raytracing challengingware can provide further obtains, especiassociate with Intel’s raytracing perestablishation. Intel’s raytracing acceleration method seally aligns with the DXR 1.0 standard. A raytracing accelerator (RTA) autonomously handles traversal and begines hit/miss shaders by sfinishing messages to the Xe Core’s thread dispatcher. STOC bits could let the RTA skip shader begines if the ray intersects a finishly see-thharsh sub-triangle. For an nontransparent sub-triangle, the RTA can alert the shader program to skip alpha testing, and end the ray timely.

Illustrating the problem STOC tries to solve. Yes I employd rectangles, but I wanted to employ the same leaf texture depict as Intel’s paper. And Intel’s leaf nodes store rectangles (triangle pairs) anyway

Xe3 conveys STOC bits into challengingware raytracing data arranges with two levels of sophistication. A basic perestablishation grasps 64B leaf nodes, but conceiveively discovers space to fit 18 extra bits. Intel’s QuadLeaf arrange reconshort-terms a combined pair of triangles. Each triangle gets 8 STOC bits, proposeing four sub-triangles. Another two bits propose whether the any-hit shader should do STOC emulation in gentleware, potentiassociate letting programmers turn off challengingware STOC for debugging. This mode is named “STOC1” in code.

Sketching out triangle (leaf) node establishats for Xe/Xe2 and Xe3. Blue = STOC rhappy, purple = non-STOC raytracing data arrange alters

A “STOC3” arrange consents skinnygs further by storing pointers to STOC bits rather than embedding them into the BVH. That apexhibits more flexibility in how much storage the STOC bits can employ. STOC3 also specifies recursion levels for STOC bits, possibly for recursively partitioning triangles. Subdividing further would lessen the number of partiassociate see-thharsh sub-triangles, which need alpha testing from the any-hit shader. Storing pointers for STOC3 conveys leaf node size to 128 bytes, increasing BVH memory footprint.

Possible carry outance obtains are exciting, but using STOC needs toil from game growers or game engines. Intel proposes that STOC bits can be produced offline as part of game asset compilation. Artists will have to resolve whether using STOC will provide a carry outance uplift for a particular scene. A scene with a lot of foliage might advantage massively from STOC. A chain join fence may be another story. STOC isn’t a part of the DirectX or Vulkan standards, which can be another obstacle to adselection. However, gentleware-only STOC can still provide advantages. That could encourage growers to try it out. If they do perestablish it, STOC-alerted Xe3 challengingware stands to obtain more than a gentleware-only solution.

We’re still some time away from authentic Xe3 products. But gentleware alters propose Xe3 is another meaningful step forward for Intel’s explicits architecture. Xe2 was a firm step in Intel’s foray into discrete explicits, providing better carry outance than Xe with a nominassociate petiteer GPU. Xe3 tfeebles the architecture aobtain and probable has analogous goals. Higher occupancy and dynamic enroll allocation would produce Xe Cores more postponeedncy uncover-minded, improving utilization. Those alters also convey Intel’s explicits architecture sealr to AMD and Nvidia’s.

XVE alters show Intel is still busy evolving their core compute architecture. In contrast, Nvidia’s Streaming Multiprocessors haven’t seen meaningful alters from Ampere to Binestablishagewell. Nvidia may have felt Ampere’s SM architecture was outstanding enough, and turned their efforts to tuning features while scaling up the GPU to upgrasp providing genereasonable obtains. Intel uncomferventwhile seeks to get more out of each Xe Core (and Xe2 accomplishd higher carry outance than Xe with restricteder Xe Cores).

Intel Arc logo on a product box

In a analogousity with Nvidia, Intel is pushing challenging on the features front and evidently has alloted into research. GPUs frequently try to evade doing misemployd toil. Just rasterization pipelines employ timely depth testing to evade unhelpful pixel shader invocations, STOC evades spawning unhelpful any-hit shaders. It’s too timely to alert what charitable of contrastence STOC or other Xe3 features will produce. But anyone mistrusting Intel’s pledgement to moving their GPU architecture forward should consent a solemn see at Mesa and Intel Graphics Compiler alters. There’s a lot going on, and I see forward to seeing Xe3 whenever it’s ready.

Source join