The largest change to Xe3 is that it’s simply bigger, with render slices scaling as much as extra Xe cores per slice, a rise in L1 cache from 192KB to 256KB, a major improve in L2 cache, and extra registers which might be higher utilized.
Micro benchmarks present important enhancements in occluded primitives culling for pointless triangles when rendering recreation scenes along with enhancements in anisotropic filtering.

Its variable register allocation and register adjustments additionally goal to unclog the pipeline in order that the {hardware} will be higher utilized, as one of many largest issues with Arc in its present Xe2 and Battlemage implementation has been that there’s loads of {hardware}, however it’s not getting used correctly. This can be a mixture of each {hardware} points, like with fastened perform models within the structure, and driver points, which it has been slowly addressing. A few of this included transferring off of emulation of issues like execute oblique beforehand to eradicate overhead.
For Xe3, Intel famous to us a few of its driver enhancements and software program management panel focus as nicely, all of which ought to profit the corporate because it strikes towards its eventual dGPU Celestial GPUs.
This accompanies plenty of different bulletins associated to its Panther Lake cellular options and laptop computer {hardware}, plus some “AI” and NPU {hardware}.
We’re principally going to give attention to the IP block of Xe3 and the structure and received’t be as centered on the product aspect for laptops.
Though this isn’t a dGPU half, it’s possible that this strategy will both be immediately discovered within the subsequent dGPU or will at the least point out which route Intel goes.
Intel was clear that this isn’t precisely Celestial, which is the structure following within the Alchemist – Battlemage – Celestial – Druid lineup. Intel famous that “Xe3P” will observe Xe3. The “P” unironically stands for “Plus,” displaying previous Intel habits die exhausting. Intel didn’t verify this, however the impression we bought is that Xe3P would be the “actual” Celestial GPUs, whereas this Xe3 makes main adjustments that possible set the stage for it.
Overview of Bulletins
Intel had a variety of bulletins to share with the press for as we speak. For our protection, we’re focusing virtually completely on the Xe3 adjustments and micro benchmarks. We’ll cowl a few of the different information as nicely, like efficiency/Watt enhancements and XeSS adjustments, however we’re not going to get into the NPU and AI processing adjustments as we speak. There’s sufficient to speak about simply with the stuff that’ll have an effect on client desktop elements sooner or later (plus the instant affect to laptops).
All of this follows the announcement that NVIDIA is investing in Intel to construct its personal cellular elements with them later, however there’s no information on that matter as we speak. That is all Intel’s {hardware}.
Naming Confusion
Briefly on the naming: Intel admitted its naming mixture of Xe for IP and Alchemist / Battlemage / Celestial / Druid for branding has been complicated. It was cautious to notice that these elements usually are not Celestial and the impression we bought was that they don’t need to burn the title on an incremental enchancment previous to a pending main overhaul. Intel is sticking with “Arc B-Sequence” for the Panther Lake cellular elements, however is transferring to the Xe3 structure. Xe3P will possible be Celestial or desktop elements later.
Xe3 IP GPU Block
Intel particularly talked about designing Xe3 to scale to bigger configuration sizes, which might be excellent news for anybody who desires to see one thing higher-end than a B580-class card sooner or later.
Let’s get into micro benchmarks first, then take a look at the block diagram.

This can be a chart of micro benchmarks, that are workloads designed to focus on extraordinarily particular features or behaviors on a product. A 2x enchancment right here received’t equal a 2x enchancment in most real-world purposes, however these enable us to see the place the enhancements are showing. Intel revealed these for Xe2 additionally.
In Xe3 for “depth writes,” Intel says it noticed a 7.4x relative efficiency enchancment normalized to clock frequency. We’re not sure, however our understanding is that this isn’t remoted for configuration dimension. Because of this this isn’t an ideal comparability because the Xe core rely is completely different between Xe2 and Xe3 in these assessments. This 7.4x enchancment outstrips the change in configuration dimension, although.
We requested Intel what “depth writes” means. The corporate advised us that it’s associated to high-Z culling and that this bar represents higher primitives culling within the pipeline, that means culling of unseen triangles and geometry sooner within the pipeline in order to not waste assets rendering unseen objects in-game. An instance may be if a constructing is obstructing a participant — there’s no level rendering the participant if it may well’t be seen. Culling isn’t new and batching primitives in ways in which eradicate occluded primitives has been round endlessly, however this exhibits that there’s nonetheless loads of floor to achieve right here for Intel. It will lead to higher utilization of assets and allocating them to extra productive work. Intel advised us that the development to this course of is disproportionately useful, that means that it ought to have an effect in gaming efficiency that will be extra noticeable than different enhancements. We’d anticipate this to hold over to future Celestial dGPU elements as nicely.
The “Excessive Register Stress Shader” part additionally noticed a big uplift in micro benchmarks at 1.9x to three.1x. Scattered reads improved by 2.7x on the relative scale of time, with Intel noting to us that this has to do with utilizing samplers to learn knowledge scattered throughout one thing like a texture (versus a well-organized knowledge set).
Mesh rendering can also be proven right here, with Intel telling us that Xe2 had already offered a proof of idea round bettering mesh shading. Intel famous that this micro benchmark is consultant of workloads the place a variety of polygons are current, telling us that the uplift comes from a bigger cache and extra environment friendly use of its registers. Culling additionally contributes.
Rapidly, Intel additionally noticed uplift in anisotropic filtering, which is the previous perform that helps enhance smoothness of textures and objects proportionate to the view frustum’s angle. Ray-Triangle intersection additionally improved by 2x within the microbenchmarks on the relative scale, which is noteworthy since Xe2 already benefitted from comparatively massive ray tracing enhancements.
Looking back at the Xe2 micro benchmarks, Intel then highlighted Draw XI and Compute Dispatch XI primarily. On the time it talked to us about this chart, Intel advised us that this was resulting from implementation of native execute oblique help for oblique draw and dispatch, versus its Xe1 emulation of those features.
Block Diagram

Time to get into block diagrams for a way the brand new Panther Lake Xe3 block is constructed. This exhibits a 12 Xe-core configuration as the utmost dimension introduced for cellular, with this configuration carrying 16MB of L2 cache, 2 geometry pipelines, 12 samplers, and 4 pixel backends. The L2 cache is noteworthy right here.

That is the brand new Xe3 render slice. A render slice is Intel’s terminology that defines a block on the GPU containing Xe cores. For reference, the Battlemage B580 with Xe2 has 20 Xe cores on 5 render slices, so every slice is only one a part of the full GPU.
The Xe2 slice had 4 Xe cores each, with Xe3 transferring to six Xe cores per render slice. Intel additionally intends to scale-up the configuration dimension on cellular gadgets to a most of 12 Xe cores (or 2x render slices, up from 8 Xe cores on a previous 2-slice configuration).
The Xe3 render slice exhibits that every Xe core has 8 vector engines, which is unchanged from Xe2 cores; nevertheless, Intel is growing the cache dimension in Xe3. Intel’s Tom Petersen acknowledged, “The very first thing we’ve performed is improve the scale of our L2. By growing the scale of the L2 from 8MB to 16MB, we diminished the visitors that hits the reminiscence interface. That’s essential as a result of the reminiscence interface is often some of the treasured assets on a graphics chip. We will see wherever between 17% and 36% visitors discount heading in the direction of reminiscence, which has a major efficiency impact on these completely different purposes.”

Taking a look at Intel’s first-party outcomes, it presents the development within the type of relative visitors on the SoC cloth (within the vertical axis) in opposition to a baseline 8MB L2 cache. Cyberpunk with RT confirmed a 19% discount, Black Fable rasterized confirmed a 36% discount, and the rasterized Metal Nomad check confirmed a 17% discount.

Intel additionally advised us that it has elevated its L1 Cache by 33%, noting a transfer from 192KB to 256KB. After we requested Tom Petersen which space of uplift he thought had essentially the most affect on total efficiency, he pointed us towards the register and thread adjustments. Intel has elevated thread rely upwards of 25% relying on configuration and has moved to variable register allocation. Petersen famous that occupancy of the compute models (together with on Battlemage) beforehand wasn’t at all times excessive, regardless of them being accessible for work, that means that there was extra GPU {hardware} current than was being correctly utilized by purposes. Intel has centered on this in each drivers and {hardware}. He famous that earlier register allocation and thread rely decisions would “starve the pipeline if the shader used too many registers,” which is being addressed.

The ray tracing unit additionally bought enhancements. Intel says it “slowed down dispatches of recent rays whereas the sorting unit catches up,” citing out-of-order dispatch and triangle testing. The ray tracing unit enhancements appear to be largely attributed to asynchronous dispatch-test processes.

Intel additionally highlighted a brand new URB supervisor as a part of its fastened perform enhancements, which can also be the place we discover the anisotropic filtering uplift. Petersen acknowledged this, “We additionally now have a brand new URB supervisor, which permits partial updates versus flushing the entire thing. Our URB is a construction the place we go outcomes between our models within our GPU. It was considerably of a serializing level; now we are able to really use that partially with out flushing every advanced.”
Body Inspection

We thought these subsequent couple slides have been fairly fascinating as nicely:

Intel confirmed a body on Xe3 versus Xe2. These usually are not normalized for configuration dimension, so it’s not an ideal comparability and it exhibits a 12-core vs. 8-core configuration, disallowing an ideal like-for-like inspection. That is iso frequency and energy, so it’s at the least normalized there.
The horizontal axis is for API name execution, with the vertical axis being milliseconds of time to execute throughout a single body being drawn (increased is worse). That is for Cyberpunk 2077.
Of notice, Intel exhibits an 8ms discount to Xe3 with the compute and pixel shader part towards the top, assigning a few of that uplift to the change to the variable registers and L1 cache dimension improve. We will additionally see that, based on Intel, the L2 advantages the render base go with a 0.39ms enchancment, preceded by the transfer to 10 threads (and variable registers) offering a 2.93ms enchancment within the pre-pass.
Extra broadly, Petersen advised us in a name that the register allocation and variety of threads would starve the pipeline if the shader used too many registers beforehand, which is being partially addressed right here. He mentioned that the earlier structure might trigger a discount within the utilization of accessible compute assets resulting from common flushing of the pipeline resulting from common reallocation into reminiscence.
This picture is fairly cool and is a take a look at what really occurs in a body when it’s being drawn. We now have a full video talking about this beforehand.
Energy Supply

Intel’s give attention to energy supply and energy administration cites learnings from the MSI Claw (learn our review) gadgets and principally comes within the type of guaranteeing correct useful resource allocation for energy finances between the CPU and GPU, which ought to profit laptop computer and handheld gadgets which have a restricted energy finances break up between the 2.

Intel famous that beforehand, an absence of software consciousness meant that the machine might generally divert an excessive amount of energy to the CPU, leaving the GPU bottlenecked on its energy restrict whereas the CPU supplied a stage of efficiency that wasn’t being kept-up with by the GPU.

Intel gave the MSI Claw for instance of a time this didn’t go nicely.
The corporate famous that it improved on this earlier within the yr with its Clever Bias Management v2 and is now introducing a v3 to construct upon that.

As a result of the system was beforehand unaware of the applying being run, on this case a recreation, Intel mentioned that software program and {hardware} wouldn’t appropriately stability the workload between the CPU and GPU, leading to stuttering resulting from being energy starved.
“Clever Bias Management v2” took GPU heuristics and utilization metrics to then inform thread scheduling and useful resource project on the working system-level. Intel had beforehand marketed enhancements to 1% and 0.1% low metrics through higher body interval pacing because of this transformation.
The brand new v3 model of this provides E-core first scheduling, which is self-explanatory in that E-cores get scheduling first when gaming. This sounds worse, and sometimes could be, however Intel says that the top result’s diminished energy diversion to the CPU through the use of decrease energy cores previous to P-cores, freeing-up extra of the shared whole energy finances to go towards the GPU as an alternative. In GPU-bound eventualities, like many video games notably on handheld gadgets, this can be a higher consequence than burning energy on a part that isn’t as burdened.

This comparability between Panther Lake and the prior technology of this bias management resolution exhibits that peaks in energy utilization have smoothed-out whereas the GPU energy consumption has leveled to be extra predictable. Reminder: This can be a first-party assessments. The GPU can also be getting extra whole energy finances as a proportion than beforehand, whereas lowering CPU energy in alternate. For GPU-bound eventualities particularly, this must be a greater consequence. It’d assist in some CPU-bound eventualities as nicely.
XeSS Multi-Body Technology and Different Modifications
Intel additionally introduced XeSS 3, which incorporates XeSS-Multi-Body Technology (or XeSS-MFG). Just a few extra letters they usually’ll have the entire alphabet.
XeSS-MFG is conceptually just like NVIDIA’s MFG. XeSS-MFG takes 2 actual frames to calculate optical stream networks utilizing movement vectors and the depth buffer, then makes use of that info to generate as much as 3 frames between the two actual frames. The frames are then displayed so as and paced in a method to decrease animation error. We even have a separate deep-dive on our new animation error testing methodology.

The brand new “XeSS Body Technology Override” setting within the driver software program permits the person to set 2x, 3x, or 4x mode.

Intel offered a number of timelines of a single body. One at native, after which a number of with varied ranges of XeSS know-how in use. The shorter the length that the body is on the X-axis, the much less time the body took to finish. The highest half of every exhibits directions and the underside half exhibits when the geometry pipeline is energetic.
In comparison with native, the raster, RT, and denoise sections of the body are shorter on the XeSS 3 timeline resulting from rendering at a decrease decision. The primary purple part represents XeSS-SR to carry out the upscaling. The second purple block begins with the optical stream portion of body gen, adopted by 3 body technology operations.

It looks as if Intel’s argument is that your complete body gen course of takes much less time than drawing one actual body, and is due to this fact higher or one thing, however this completely ignores picture high quality. We’ve proven with each AMD FMF and NVIDIA MFG that the picture high quality sacrifice isn’t at all times price it. Generally it’s, however it’s not at all times so simple as being that manner. Intel acknowledged that these frames upscaled with XeSS-SR are the identical high quality as native, which is unlikely. Intel acknowledged: “That body is nearly as good because the prior image, the native body. However it’s really being run faster.” We doubt this will probably be broadly true and can consider afterward dGPUs. It was bullshit when NVIDIA claimed it, too. The standard will be good, however isn’t nearly as good.
Intel had another side-by-sides that we take problem with, and that together with nonetheless having watermarks on the video means we’ll skip them and simply check it ourselves later.

Intel referred to the body gen course of as trying into the longer term. NVIDIA CEO Jensen Huang has mentioned comparable issues about NVIDIA’s body technology. Each of them are flawed, as a result of all present strategies of body technology rely completely on completed frames and engine knowledge. These frames already existed and will have been displayed as an alternative of holding them to run the body technology in between. That isn’t trying into the longer term, that’s interpolating between two sequential snapshots of the current or close to current. Till a predictive methodology of body technology comes out, none of those applied sciences look into or generate “the longer term,” they at finest interpolate the previous. And that’s superb, however we’d actually prefer it if these firms might get their shit collectively and cease saying that they generate the longer term.

MFG represented on benchmark charts has been a serious and ongoing controversy and misrepresentation of efficiency on NVIDIA’s aspect of issues. Intel dedicated to counting on base raster efficiency with out body technology because the baseline for efficiency and mentioned that, when it publishes numbers together with upscaling or body gen, these will probably be offered as supplemental to the bottom metric. We predict this can be a higher stability of selling the aptitude with out completely misrepresenting the truth.
Intel additionally talked a couple of new model of PresentMon that features a few adjustments, partly accounting for body technology know-how.


