Trends and Forecasts in Computer Graphics— power-efficient rendering

This is the second in a series of observations about computer graphics (CG) trends, opportunities, and challenges.

(the first installment can be found at http://jonpeddie.com/blogs/comments/trends-and-forecasts-in-computer-graphics/ )

john-chapman-graphics.blogspot.com

Introduction

Today’s mobile processors are improving at an astonishing rate – and consequently delivering visually stunning user experiences in less than 3 Watts of power consumption.

Just how is it that mobile processors running at 5% of the power of a game console, can produce the kind of graphics realism we had previously come to expect only from game consoles.

It’s all about the pixel and making it shine, without drinking a lot of power. Let’s look at a few of techniques being used in modern SoC application processors to achieve these fantastic results.

Creating realism within the constraints of mobile devices

Computer graphics (CG) features like hardware tessellation, geometry shading, and high dynamic range (HDR) are developed on big PCs and workstations. Once proven, there’s a natural rush to migrate these techniques to the mobile platforms. The mobile platforms are getting larger screens, more powerful processors, and high-bandwidth capabilities, all the elements needed for advanced CG. However, all of those things and others come with a cost—power consumption. So the challenge is how to get stunning real-time workstation-class graphics in a battery-powered device that can run for 8 or more hours and not become too hot to hold.

We’re not going to have PC laptop levels of battery capacity, or terabytes per second of memory bandwidth, or active cooling in any reasonable, light weight mobile devices anytime soon, if ever. Given those constraints, how do we increase the image quality and realism to provide a more enjoyable experience?

The first step in any power management system, whether it’s a phone, your home, or a PC, is to turn things off that aren’t being used. In the case of digital processors (CPUs, GPUs, DSPs, etc.) you can lower the voltage to them, turn off unused chip components, and slow down the clock, the heartbeat, of the device.

Power management

Other than a breakthrough in battery technology there are two ways to manage the balance between power consumption and performance/features: through the hardware and/or software. Modern smart phone and tablet processors like Apple’s A7, and Qualcomm’s Snapdragon have sophisticated DCVS (differential clock and voltage scaling) algorithms that help save power. These algorithms run in the background like a robot, checking for workloads, screen and sensor usage, and other activities. Whenever a function or feature isn’t being used, the circuits associated with it are either turned down, or off. That’s smart power management and soon all devices will have it. Intel’s new Mayfield processor has a very advanced power management processor in it.

With rudimentary power management under control, the developers of the modern SoCs used in mobile devices looked into algorithmic opportunities to deliver realistic images. The semiconductor suppliers, working in their research labs, and with game developers, movie studios, as well as with university computer scientists, clever techniques have emerged over the past two to three years. These developments have been employed in many platforms from workstations, to PCs, game machines, and of course mobile devices like tablets and smartphones.

Tiling, chunking, and tile based deferred rendering (TBDR)

Back in 1991, a handful of computer scientists at Microsoft Research launched the Talisman project to investigate new techniques to improve rendering time while scaling screen resolution and color depth. The results were so promising that by 1996 when they publically announced the project Microsoft had over 100 people working on it. But the hardware design proved to be more challenging for the semiconductor process technology available at the time. However, several ideas from the project were successfully developed, albeit, not quickly. Among the first techniques to emerge were tiling and deferred rendering, which was originally developed at Pixel Planes at the University of North Carolina (UNC). Microsoft utilized that knowledge to develop it for Talisman. SW renderers, like Reyes at Pixar also utilized tiled methods, prior to Microsoft.

Tiling can be thought of like MPEG, where the image is broken up into small sections and updated only when the image within them changes. In addition, tiling uses a tiled z (depth) buffer to determine if a portion of a polygon is visible, and if not, it’s not rendered. The process is similar to a traditional 3D graphics pipeline, but reduced to the size of small tile. Tiling can also take advantage of early Z buffering (buffering early in the pipeline, after scan conversion, but before shading), to reduce work further.

Figure 1: Z-depth rejection rendering example (Qualcomm)

One of the first companies to successfully exploit this technique was Imagination Technologies in 1988 with their Power VR design that was used in the ill-fated Sega Dreamcast. Since then it has been used in many low power graphics processors, and most noteworthy the Apple iPhone and iPad. Identifying the portions of each frame that can be ignored in very early stages of rendering is an excellent way to reduce the work done by the GPU.

Tile based Deferred rendering

In 2008 the concept of tile based deferred rendering (TBDR) in mobile devices was introduced by Imagination Technologies with their SGX5 GPU design. As the name implies, tile based deferred rendering is to postpone processing of some of the pixel properties (shading, texturing, blending) to maintain acceptable interactive performance. This is an implementation of the generic concept of hidden surface removal or visual surface removal (VSD). The TBDR algorithms only process and display those parts of the scene that are more visually important.

The term deferred as used here, can mean two entirely different things, i.e. either avoiding unnecessary work (TBDR), or separating the geometry pass from the lighting/shading pass; e.g. “deferred shading,” used in modern game engines.

So in deferred rendering, rasterization is postponed until all of the polygons have been supplied. In immediate mode rendering, the polygons are rasterized as soon as they arrive, regardless of where they are on the screen. Traditional desktop PC or game console GPUs only support direct rendering mode, and it is one of the reasons why they consume more power.

Immediate mode rendering

Immediate mode, or direct rendering mode bypasses the internal tile buffers (if present for the GPU hardware in question) and writes pixels out to the frame buffer in system memory immediately, without any batching or any other overhead that is inherent to tile based rendering. Direct mode rendering can be more power efficient with frames that either have minimal or no depth complexity (i.e. single layer), or for scenes that require lots of mid-frame updates or small partial updates, etc.

When Qualcomm introduced its Adreno 320 GPU in the Snapdragon 600 in October 2012, it was the first SoC able to switch dynamically between the two graphics rendering modes, either at the request of the application or based on a heuristic analysis of the rendered scene, which provided incremental power savings and better rendering performance.

Other deferred techniques

There are various techniques for deferring an operation, with the goal being to conserve power consumption. The best deferred rendering algorithms display those parts of the scene that are more important visually than others. For example, deferred polygon rendering displays foreground, larger polygons, or deferred shading.

Deferred shading

I wrote about some of the new features in OpenGL ES 3.0 (GLES3) in the last installment (see link at the beginning). In addition to exposing new graphics features, the latest version of OpenGL for embedded systems (the “ES” in OpenGL ES) also has features for improving power efficiency.

GLES3 also introduced its own form of deferred rendering, wherein a two-pass model is employed. The first pass gathers data required for shading computations such as positions, normals, and materials, which is then transferred into the geometry buffer as a series of textures. In the second pass, a pixel shader computes the direct and indirect lighting at each pixel using the information of the texture buffers, in screen space. This gives the ability to render many lights in scene at higher frame rates with only one geometry pass, and results in lower power consumption. Qualcomm has made great use of this feature in the Adreno GPU design.

Transform Feedback—deferred data

One of the best ways to save battery power is to not repeat operations. A new feature found in GLES3 is Transform Feedback, which is a process of capturing and saving the data from the primitives into buffer objects so that they can be reused for other primitives.

Transform Feedback (which ARM calls XFB) allows the application to capture vertex shader outputs to a buffer, and then read it back to the CPU, or used as a vertex buffer in another draw call. This is a much more power efficient way of handling the vertex processing than rereading data from host memory for every primitive every time, and it is another example of how smart our personal smart devices have become.
Transform feedback support is mandatory in GLES3, so all OpenGL ES 3.0 qualified (certified) devices have it. A list of ES conformant products can be found here:http://www.khronos.org/conformance/adopters/conformant-products#opengles

Some people see transform feedback as ‘compute shader lite’ – its functionality and applications are pretty similar, but a lot less flexible. The main thing it gives you that compute shaders do not, is access to any specialized vertex-pulling hardware, and the ability to capture the vertex data simultaneously without redrawing it. Being explicitly stream-oriented, it may also allow the driver or HW to do some optimization.

However, transform feedback is here today whereas compute is probably over a year out before it has sufficient market adoption in shipping mobile devices so many SoC suppliers are recommending developers use transform feedback today and not just ignore it.

Significant uses cases that are more common are:

Particle effects – Here you would only use it if you have a complex particle simulation that can be greatly simplified by moving it to an iterative process. So imagine you have a shader that takes hundreds of instructions to calculate the final particle position for each frame.

Figure 2: A particle system used to simulate a fire, created in 3dengfx

Sometimes you can greatly simplify this by moving to an iterative algorithm where it’s much simpler to calculate the particle’s position if you know its position, velocity and acceleration from the previous frame. Transform feedback lets you save all the info on a per-frame basis. You do need to be careful about using excess bandwidth or you could lose the benefit.

Complex vertex skinning that is used for multiple frames – The transformed vertices can be saved and read in. This can sometimes be both a bandwidth and compute savings. Here again, developers are advised to not use this method if they are only using the skinned vertices once. It has to be used in a case where one can reuse the skinned vertices multiple times to be a win.

In summary, these are only optimizations for when something in the scene is reused several times, but these are new technologies that will become commonly used and they will save power when doing so.

Deferred drawing instructions

In a somewhat SIMD-like (single instruction multiple data) operation, the geometry “instancing” feature found in GLES3 allows developers to render multiple instances, or copies with modified attributes such as color or transforms corresponding to the same object with a single draw call. This allows for a reduction in the number of draw calls for a frame and for more of the work to be moved from the CPU to the GPU, delivering higher performance and lower power consumption. Compare that with the old OpenGL ES 2.0 API where you had to draw each geometry instance with a separate draw call. Each draw call comes with considerable CPU overhead for performance and power. Geometry Instancing features allows for a much greater power efficient implementation for the same scene. Qualcomm was one of the first to implement this function in Snapdragon.

Squeezing textures

Think of the asphalt surface of a road in a driving game, or the brushed metallic background seen in some icons of modern graphical user interfaces (GUIs) – these are graphics “textures” and they consume the biggest chunk of the static and dynamic memory utilization within a graphics or gaming application. They also directly contribute to the application’s power consumption because every byte of memory that needs to be read by the GPU from host memory and processed into final pixels consumes power. I spoke about GLES3’s new compression engine, ETC2 (Ericsson Texture Compression) in the paper referenced at the beginning. OpenGL ES 3.0 brings in newer and better, royalty free, ETC2 texture compression format (with alpha support) all hardware suppliers (IHVs) are required to support it.

Figure 3: Texture compression (Khronos)

This makes it simple for developers, as going forward they do not have to compress their textures in separately for each device.

Apart from these standard texture compression formats most GPUs also have a native optimized texture compression format that is optimally designed for best compression method for that particular GPU. For example, Adreno GPUs includes ATC texture compression as their natively supported format. Conversely, Nvidia doesn’t think the latest open standard Khronos “ETC2” texture compression format in GLES3 will be quickly adopted and so is only offering their Tegra 4 processor with older, proprietary texture compression formats such as DXT.

OpenSubdiv

One of the hot topics at the 2013 Siggraph was Open Subdivision of surfaces. Although the concept has been around for some time, it’s been proprietary and only available on very high end systems. Pixar, the developer of the concept, first showed it in a demo video called Geri's Game back in 1997 at Siggraph (and it won an Academy Award for Best Animated Short Film in 1998). In 2013 Pixar submitted it to the industry in an open form.

Figure 4: Pixar's Geri's Game, circa 1997

When you see Geri, the first thing you notice is the quality of his skin. It`s smooth, malleable—as if made of soft clay. The flexible skin is a direct result of subdivision surfaces, developed by Pixar. With this technology, Geri`s whole face and head were created with one surface, one skin.

Normally, we think of subdivision as a high-end modeling tool for non-real-time applications, i.e. graphics scenes that are rendered offline, requiring hours rather than fractions of a second per frame, but it has now come to mobile as well. Pixar has released the open source version as Open SubDiv, and Motorola together with Qualcomm have optimized that code for OpenCL (the open standard compute language from Khronos) running on the GPU, first implemented in Motorola’s X phone.

The algorithm could be run on the CPU (as it was in the past on workstations), but the parallel nature of the algorithm is well suited to a GPU, and the workload is well balanced with conventional rendering on the GPU. The benefit of OpenSubdiv is that 10’s of millions of triangles per object are no longer required. All you need is a sparse geometric mesh upon which the OpenSubdiv algorithms can create the sorts of finely detailed, incredibly realistic, Pixar- quality images that were previously only possible with power-hungry offline or desktop PC rendering, can now be rendered in real time on a mobile device without draining the battery or making it too hot to hold.

Proprietary concepts

As you can imagine dozens of semiconductor companies are searching for methods to reduce power while improving performance. Generally they adhere to industry standards, and/or submit their ideas for adoption by the industry so all applications can run on all machines. However, once you get beyond the application and closer to the screen, a semiconductor manufacturer can take more liberty.

Qualcomm’s Assertive Display

Mobile devices, by their very definition, are used in a wide variety of lighting conditions. One of the most challenging is direct sunlight. The default response is to simply crank up the backlight, but this is both power hungry and naïve – you can't compete with the sun. Qualcomm came up with a novel approach for their Snapdragon 800. It contains dedicated hardware and software that enables a per-pixel dynamic range adjustment algorithm to effectively increase the contrast between the adjacent objects that are being displayed, based on sensor input that indicates the ambient lighting conditions.

Figure 5: Qualcomm's Assertive Display

It saves power and makes it much easier to use your mobile device in bright sunlight. At recent industry events, Qualcomm has shown a simple, side-by-side demo on Snapdragon 800 devices with this Assertive Display feature turned on vs. another with the backlight cranked all the way up, and the difference is dramatic. This feature will first come to market in the new Amazon Kindle Fire HDX that uses Qualcomm’s Snapdragon 800 mobile processor.

Trusight’s Neuromorphic curves luminance to match the eye

Trusight, a Silicon Valley startup, introduced a technology based on years of research around how the eye sends light information to the brain. The Trusight implementation sits in the RGB or YUV buffer, anywhere in the capture, distribution or display workflow, and effectively performs impedance matching of the visual content to how the eye sends light information to the brain. This lightweight pixel shader implementation can sit in firmware on a device, where after utilizing a histogram, Trusight executes just 15 instructions on all pixels to restore the dynamic luminance values sensed by the imager. Avoiding the complexity of tone mapping, the process eliminates the normal levels of backlight power required to view an image and obtains up to a 50% improvement in battery life.

Figure 6: Trusight can bring images out of the dark

Trusight's Vice President of Engineering, Randall Eike, is obscured by shadows in the original image (top) but revealed in the Trusight-processed version, without washing out his family's facial features and other details in the process (bottom).

Trusight’s neuromorphic implementation has the additional benefit of improving content compressed for transmission, delivering more perceivable information with less bits. The results are quite dramatic even when viewing HDR content as this pre-process ingredient also aids intelligibility whenever one is viewing in direct sunlight.

Summary

A revolution is happening in the world of mobile graphics for smartphones and tablets. Today’s mobile processors and the ecosystem that supports them are improving at an astonishing rate – and consequently delivering visually stunning user experiences in less than 3 Watts of power consumption.
The main tricks are to turn things off and to reuse data as much as possible so you don’t have to pay for it again and again.I hope this discussion was informative and interesting. I’d also like to hear from you. What do you think is an interesting, or scary trend, and what would you like to see?

Trends and Forecasts in Computer Graphics— power-efficient rendering

Related posts

Intel swings to a massive $17B loss

Graphics workstation set volume record in Q1’21

Nvidia Q1 FY23 results

Recent products

Trends and Forecasts in Computer Graphics— power-efficient rendering

Related posts

Intel swings to a massive $17B loss

Graphics workstation set volume record in Q1’21

Nvidia Q1 FY23 results

Recent products

AI Processors Market Development Subscription

AI Processors Market Development Report

RISC-V Market Insights report – a report on the AI Processors market