At Computex 2024, Intel launched a new GPU architecture, Xe2 (which Intel insists on writing with a superscript e, and we don’t). Xe2 features in the new Lunar Lake platform and will be the basis for both forthcoming Battlemage discrete GPUs and the GPUs (integrated and discrete) accompanying future Arrow Lake processors.
The all-new GPU design, code-named Battlemage, combines two new elements: Xe2 GPU cores for graphics and Xe Matrix Extension (XMX) arrays for AI—what other firms are calling an NPU or TPU.
Intel says the Xe2 GPU cores improve gaming and graphics performance by 1.5× over the previous generation, while the new XMX arrays enable a second AI accelerator with up to 67 TOPS of performance for throughput in AI content creation.
Xe2 will first debut as part of the Lunar Lake platform. Lunar Lake’s all-new architecture will enable:
- New performance cores (P-cores) and efficiency cores (E-cores) designed to deliver performance and energy efficiency improvements.
- A fourth-generation Intel neural processing unit (NPU) with a claimed up to 48 tera-operations per second (TOPS) of AI performance. This would mean the NPU delivers up to 4× AI compute over the previous generation, enabling corresponding improvements in generative AI.
- An advanced low-power island, a novel compute cluster that handles background and productivity tasks with extreme efficiency, enabling amazing laptop battery life.
Lunar Lake’s microarchitecture consists of two tiles connected through Intel’s Foveros packaging technology, which also includes memory on the package.
A new compute tile contains Intel’s latest-generation efficiency cores (E-cores) and performance cores (P-cores), both of which introduce new microarchitecture enhancements focused on x86 efficiency. This time, the aim is to compete with Apple-based designs like Apple M4 and Snapdragon X Elite.
The compute tile also houses a neural processing unit (NPU 4) and image processing unit (IPU), as well as the new Xe2 graphics processing unit (GPU).
In Lunar Lake, there is also a new microarchitecture for the display and media engines.
The Xe2 unified architecture will appear first in Lunar Lake and then with Battlemage discrete GPUs, which Intel seems more confident about than Alchemist. These evolutions of Xe are designed to deliver higher utilization, improved work distribution, and less software overhead.
Like everything about Lunar Lake, the focus of Xe2 is on efficiency, and indirect execute has been implemented in hardware (providing a 12.5× improvement when it is used, claims Intel). But higher compatibility with games has not been forgotten, with Intel proudly pointing out that the just-released F1 24 Champions Edition from EA was working Day One. XeSS sits above the SW driver and uses deep learning to do supersampling in the impressive F1 24 demos shown to the press.
Xe2 is a scalable modular design built into what Intel calls render slices. The second-gen Xe core is the computational block and now SIMD-16, while the last gen was SIMD-8. This, in part, is where the improved compatibility with games comes from.
There are eight 512-bit vector engines, eight 2,048-bit XMX engines, 64 billion atomic ops support, and a 192KB shared L1 cache.
The new vector engine has: SIMD-16 native ALUs with support for SIMD-16 and SIMD-32 ops; Xe Matrix (XMX) with support for int2, int4, int8, FP16, BF16, extended math, and FP64 with three-way co-issue for integer, transcendental, and floating point in the same clock.
There are two XMX extension engines: int8 (4,096 ops/clock) and FP16 (2,048 ops/clock).
The render slice is supported by fixed functional units like a ray-tracing unit sampler, geometry sampler, rasterizer, etc.
New to Xe2 is out-of-order sampling with compressed textures and 2× throughput for sampling without filtering. Hi-Z has also been redone , with early Hi-Z culling of small primitives. (Hierarchical depth, also known as Hi-Z, is a technique that comes up often in graphics. It’s used to accelerate occlusion culling on the CPU and the GPU, screen-space reflections, screen-space ambient occlusion, volumetric fog, and more.)
On the pixel back end, Intel says the new 8N compression and fast clear for sub-resources dramatically improve performance.
Intel has dug in on ray tracing this time around, which they recognize as a fundamental algorithm in next-gen games. It’s a BVH box hierarchy/traversal pipeline approach like Nvidia’s with three traversal pipelines, 18 box intersections, and two triangle intersections per RTU.
Xe2 is all about enabling scalability for Intel, down into Lunar Lake and up into Battlemage discrete GPUs. The Lunar Lake configuration has eight Xe cores. Intel says it’s all about delivering experiences—low power, high performance, and the latest industry standards.
Xe2 is one of three big blocks for graphics in Lunar Lake, the others being a media engine and display block.
In Lunar Lake, the GPU delivers 1.5× performance over Meteor Lake at the same power, or for AI with the XMX, a peak of 67 int8 TOPS.
The display engine has three display pipes for up to 8K60 HDR or up to 3× 4K60 HDR. It has HDMI 2.1, DisplayPort 2.1, and eDP 1.5 output. The front half of the display engine is a frame buffer with a pixel processing pipeline of six planes per pipeline with media color conversion, plane scaling, and composition all in hardware. The back half is a display pipe delivering HDR perceptual quantization across all outputs.
What makes it laptop-specific, since that’s Lunar Lake’s main form factor? Panel replay and a brightness sensor with LACE (Local Adaptive Contrast Enhancement). There is 3:1 visually lossless compression and stream encoding for HMDI or DisplayPort too.
eDP 1.5 (eDisplayPort 1.5) is an evolution of panel self-refresh with adaptive sync. The killer app for this is that it adaptively sets the panel to 48 Hz for movie playback, which removes judder. There’s panel self-refresh too, whereby a buffer in the panel holds local data to reduce update bandwidth requirements. The buffer can be for a screen box of arbitrary size. On top of that, it can send multiple frames at the same time, with early transport for reduced latency and smoother and more responsive visuals.
The media engine now has an 8MB side cache, 8K60 10-bit HDR decode and encode, and VVC support that reduces bit rate at same quality. Intel says VVC delivers a 10% reduction in size versus AV1. VVC preserves data sent when streaming bandwidth changes, reducing picture-quality degradation and is apparently better for screen content coding for remote desktop, game streaming, etc.
We agree that since normal codecs were designed for movies, not desktop streaming, and VVC was designed in the streaming era, VVC is a Very Vital Codec. (That’s not what it really stands for, it’s Versatile Video Coding).
The software stack has a media interface, including an Intel Video Processing Library, and what Intel says is years of improvement in the 3D side has led to dramatic improvements.
On the side of the stack is XeSS, which sits above the driver to handle the deep learning-based supersampling. At 1080p with ray-traced shadows, it is happily within the Lunar Lake power budget, Intel says.
Will it play games?
Lunar Lake is “all about mainstream experiences, and you can also play games,” says Intel Fellow Tom Peterson. That’s not a trivial statement, since Intel’s earlier Xe generation, Alchemist, was plagued with bug and driver fixes that Peterson admits were a “sad state at the time.”
What Intel has done with Xe2 is a tremendous number of changes to be compliant with what Peterson calls “the dominant expectation.” That means not just being DX-compliant but also like Nvidia’s software architecture so that fewer changes are required by developers. This pragmatism is in common with the whole of Lunar Lake, which strikes us as one of Intel’s best considered and delivered products in some time.