AMD possible ray-tracing improvements

AMD has not been as aggressive in ray-tracing performance as Nvidia, since Nvidia’s Turing (2018) and Ampere (2020) architectures, which outperformed AMD’s RDNA 2 (2020) and RDNA 4 (2024) architectures. However, alleged upcoming improvements to RDNA 4 GPUs could enhance AMD’s ray-tracing performance, potentially providing a competitive edge. This advancement may also benefit the Sony PS5 Pro as the console market heats up, with Microsoft and Sony competing for superior performance and capabilities.

AMD RT — ***AMD’s protype RDNA 4 test board with ray-tracing improvements. (Source: Jon Peddie)***

An alleged specifications list of ray-tracing features to be included in AMD’s forthcoming RDNA4 GPU design have circled the Web’s echo chamber. They’ve been listed and relisted, but with no explanation or benefit analysis.

RDNA 4 will have, according to the asserted leak, twice as many RT intersect engines. Twice anything in a processor (other than latency or power consumption) is a good thing. But what is it?

A ray-tracing intersect engine handles the calculations required to determine where rays intersect with objects in a scene, enabling realistic rendering of light and shadows—its primary function is to accelerate ray-intersection calculations.

The intersect engine is a specialized component within a GPU (or a dedicated ray-tracing unit— RTU) that performs the complex calculations necessary to determine the points at which rays intersect with objects in a 3D scene. This process is crucial for generating realistic images by accurately simulating the behavior of light, including reflections, refractions, and shadows. The engine works by tracing the path of rays from the camera through the scene, calculating intersections with geometric objects, and then using these intersections to compute the color and intensity of pixels based on the material properties and lighting conditions. This technology is used in applications such as real-time rendering in video games, simulations, and visual effects.

It determines which rays hit specific objects, calculates intersection points and distances, and feeds data to shading engines for further processing.

AMD’s RDNA 4 will supposedly have twice as many ray-tracing intersect engines as the RDNA 3 architecture has. As such, it is expected to greatly improve the speed and accuracy of AMD’s ray-tracing capabilities.

The information from hardware information leaker @Kepler_L2 also indicated the RDNA 4 design would have a RT instance node transform capability. A ray-tracing instance node transform process updates the positions of vertices in real time to animate objects, such as zombies, by feeding these positions into the scene hierarchy generator to assemble the scene acceleration structure.

Ray-tracing instance node transform is responsible for updating the positions of vertices in real time to animate objects within a scene. For example, in the demonstration by Imagination Technologies using the Unity 5 engine, the engine performs dynamic skinning to animate zombies using a vertex shader. The updated vertex positions are then fed into the scene hierarchy generator, which assembles the scene acceleration structure in real time. This process is crucial for handling dynamic geometry in ray-traced scenes, ensuring that the animated objects are accurately represented in the ray-traced environment. Imagination Technologies’ GR6500 GPU dedicated ray-tracing unit has such capabilities.

The RDNA 4 is also purported to offer a 64-byte ray-tracing node. In the context of ray tracing, a 64-byte ray-tracing node typically refers to a data structure used within a bounding volume hierarchy (BVH) or other spatial acceleration structures to efficiently manage and traverse the scene geometry. The node stores essential information for intersection tests and hierarchy traversal, all packed into 64 bytes for optimal performance. Here’s a breakdown of what might be included in such a node:

Bounding volumes:
- AABB (Axis-aligned bounding box)—The node may store the minimum and maximum bounds of the bounding volume, typically requiring six floats (three for the minimum and three for the maximum coordinates).
- This could take up 24 bytes (six floats at 4 bytes each).
Child pointers:
- Child indices or pointers—Depending on whether it’s an inner node or a leaf node, it will have pointers or indices to its children. For an inner node with two children, you might have two 32-bit integers or two 64-bit pointers.
- This could take up 8 bytes (two at 4 bytes for indices) or 16 bytes (two at 8 bytes for pointers).
Primitive information:
- For leaf nodes, the node might store indices or references to the primitives (triangles, spheres, etc.) it contains.
- This could take up 8 bytes for storing two at 32-bit indices.
Node type and metadata:
- Flags or metadata—A few bytes are typically reserved for flags that denote the type of node (inner or leaf) and other metadata necessary for traversal.
- This could take up 4 bytes.

We also learned from @kepler_L2 that RDNA 4 could have ray-tracing tri-pair optimization. That’s not something that gets mentioned very often. Ray-tracing tri-pair optimization refers to techniques to improve ray-triangle intersection tests performance by leveraging specific properties or arrangements of triangle pairs. These optimizations aim to reduce the computational cost and improve the efficiency of intersection tests, which are a core part of ray-tracing algorithms.

In ray tracing, a significant amount of computational effort goes into testing whether rays intersect with the geometry in the scene. Since triangles are the most common primitives in 3D graphics, optimizing ray-triangle intersection tests can yield substantial performance gains. Its concepts are:

Bounding volume hierarchies (BVHs) with triangle pairs:
- Grouping triangles—By grouping pairs of triangles together and using a common bounding volume for the pair, we can reduce the number of intersection tests required.
- Shared bounding volumes—Instead of testing each triangle individually, the algorithm first checks if the ray intersects the bounding volume of the pair. If it does, only then does it proceed to test individual triangles within that volume.
Coherent memory access:
- Spatial locality—Storing triangle pairs contiguously in memory can improve cache performance. When a ray intersects a bounding volume, the subsequent access to the pair’s triangles is more likely to be cache-efficient.
- Reduced traversal steps—By processing pairs, the traversal steps in the BVH are reduced, leading to fewer memory accesses and faster processing.
Efficient data structures:
- Optimized BVH nodes—Nodes in the BVH can be optimized to handle triangle pairs specifically. These nodes might store information that allows quick rejection of non-intersecting rays or efficient intersection computation.
- Packed data—Triangles can be packed in a way that takes advantage of vectorized operations (SIMD instructions), further speeding up intersection tests.

Benefits of tri-pair optimization are:

Reduced Intersection Tests:
- By grouping triangles and using shared bounding volumes, the number of intersection tests is reduced, as the ray may only need to be tested against a single bounding volume initially.
Improved cache efficiency:
- Grouping triangles and storing them contiguously improves spatial locality, leading to better use of the CPU cache and faster memory access.
Faster traversal:
- Optimized BVH traversal with triangle pairs can lead to fewer nodes being visited, reducing the overall computational load.
Better vectorization:
- Using vectorized operations for triangle pairs allows multiple intersection tests to be performed simultaneously, leveraging the full power of modern CPUs.

Practical implementations can be:

BVH construction:
- During BVH construction, triangles are grouped into pairs based on spatial proximity or other criteria to optimize the bounding volume sizes and shapes.
Intersection testing:
- The ray-BVH traversal algorithm is modified to handle nodes containing triangle pairs. Intersection tests are optimized to first check the bounding volume and then the individual triangles if necessary.
Memory layout:
- Triangles are stored in a memory layout that favors cache efficiency and vectorized operations. This may involve interleaving data or using specific data structures that facilitate fast access and computation.

If implemented, ray-tracing tri-pair optimization is a big deal because it enhances the performance of ray tracing by reducing the number of intersection tests, improving cache efficiency and enabling better use of vectorized operations. By grouping triangles and optimizing data structures, this approach streamlines the process of determining ray-geometry intersections, resulting in faster and more efficient rendering.

We also learned from the reported leak that RDNA 4 would change flags encoded in barycentrics to simplify detection of procedural nodes.In ray tracing, efficiently managing and detecting procedural nodes is crucial for performance and flexibility, especially when dealing with complex scenes that include both traditional geometric primitives (like triangles) and procedurally generated content. One technique to facilitate this is to use barycentric coordinates to encode flags, simplifying the detection and handling of procedural nodes. Let’s break this down.

Barycentric coordinates are a coordinate system used in the context of triangles. For any point within a triangle, its position can be uniquely described by three coordinates, which represent the point’s relative position to the triangle’s vertices. These coordinates are often used in ray-triangle intersection tests.

Barycentric coordinates have three components, typically denoted as uuu, vvv, and www (with u+v+w=1u + v + w = 1u+v+w=1). These components can be used for more than just locating a point within a triangle; they can also encode additional information.

Procedural nodes refer to parts of a scene that are generated algorithmically rather than being explicitly defined by traditional geometric data. These can include procedural textures, complex geometric shapes generated by fractals, or other algorithmic methods.

By embedding flags within barycentric coordinates, one can efficiently signal the presence of procedural nodes during ray tracing. Here’s how it works: When a ray intersects the triangle and the barycentric coordinates of the intersection point are calculated, these ranges are checked to determine if procedural handling is required.

The benefits are efficiency and performance. Embedding flags within barycentric coordinates avoids the need for separate data structures or complex condition checks, streamlining the ray-triangle intersection process.

Performance is improved by reducing the overhead of detecting procedural nodes, and the ray tracer can maintain high performance even in complex scenes.

Using barycentric coordinates to encode flags is an elegant and efficient method to simplify the detection and handling of procedural nodes in ray tracing. This approach leverages the existing calculation of barycentric coordinates during ray-triangle intersection tests to embed additional information, enabling quick and seamless integration of procedural content within the ray-tracing pipeline.

Included in the alleged new characteristics of the RDNA 4 would be BVH footprint improvement. A BVH is a tree structure where each node represents a bounding volume that encloses a set of primitives (e.g., triangles). Internal nodes contain bounding volumes that encompass their child nodes, and leaf nodes contain actual geometric primitives. The footprint of a BVH refers to the amount of memory it occupies and how well it fits into the CPU/GPU cache. A smaller footprint generally means better cache coherence and less memory bandwidth usage, leading to faster traversal and intersection tests. A more compact representation for BVH nodes is used to reduce memory usage. This can involve encoding the BVH nodes in a way that minimizes the amount of memory needed to store them. For example, instead of storing child pointers explicitly, one could use indices that can be packed more densely.

And lastly, we discovered from the leak that there might be RT support for OBB and instance node intersection detection in RDNA 4. Ray-tracing support for oriented bounding boxes and instance node intersection detection are advanced techniques used to improve the accuracy and performance of ray-tracing algorithms.

Unlike axis-aligned bounding boxes (AABBs), which are aligned with the coordinate axes, OBBs can be rotated to fit more tightly around the object they enclose. This often results in a smaller volume and fewer false positives during intersection tests.

In ray tracing, detecting intersections with OBBs is more complex than with AABBs due to their arbitrary orientation. Specialized algorithms are used to test if a ray intersects an OBB. But, OBBs provide a tighter fit around objects, reducing the number of ray-primitive intersection tests. This can significantly improve performance, especially for complex or elongated objects.

Using OBBs reduces the number of intersection tests by providing tighter bounds. Instance nodes reduce memory usage and allow efficient transformations, which improves overall rendering performance. OBBs can more accurately represent the bounds of rotated or skewed objects, leading to more precise intersection tests. Instance nodes allow for the accurate placement and transformation of repeated geometries in the scene.

RT support for oriented bounding boxes and instance node intersection detection are crucial for improving the efficiency and accuracy of ray-tracing algorithms. OBBs provide tighter bounds around objects, reducing the number of intersection tests and improving performance. Instance nodes allow for efficient memory usage and accurate transformations of repeated geometries in a scene. Combined, these techniques enable the rendering of complex scenes with higher performance and accuracy, making them essential for advanced ray-tracing applications.

So, if all these extremely esoteric deep architectural improvements are made to AMD’s RDNA 4 GPU, and are, as has been suggested, incorporated in the customer APU for Sony’s next-gen PlayStation, AMD can boast of some significant improvements in RT processing speeds and, therefore, FPS at 4K.

But how many people would recognize that from just the list that was flashed all over the Web?

AMD	Broadcom	Innosilicon	Loongson Zhongke	Siroywe
Apple	Denglin	Intel	MetaX	Xi-Silicon
AzurEngine	HiSilicon	Jingjia	Moore Threads	Zhaoxin
Biren	HongShan Micro	Lingjiu Micro	Nvidia
Bolt	Iluvatar	Lisuan	Qualcomm

AMD possible ray-tracing improvements

Related posts

Fallout

AI PC segmentation

The Rise and Benefits of Companion Robots

Recent products

AMD possible ray-tracing improvements

Related posts

Fallout

AI PC segmentation

The Rise and Benefits of Companion Robots

Recent products

Overview of PC Client CPUs and iGPUs

Summary Report on the Worldwide Total GPU market

2024 Worldwide CAD Report