Game engines and shader stuttering: Unreal Engine's solution to the problem
Hello everyone, we are Kenzo ter Elst, Daniele Vettorel, Allan Bentham, and Mihnea Balta, some of the engineers who worked on Unreal Engine’s PSO precaching system.
Recently, there have been a number of conversations taking place in the Epic community around shader stuttering and its impact on game developer projects.
Today, we’re going to dive into why the phenomenon occurs, explain how PSO precaching can help solve the issue, and explore some development best practices that will help you minimize shader stuttering. We’ll also share our future plans for the PSO precaching system with you.
If you’re interested in learning more, don’t miss our Inside Unreal livestream this Thursday, February 6 on Twitch or YouTube at 2 PM ET.
Shader compilation stuttering happens when a render engine discovers that it needs to compile a new shader right before it uses it for drawing something, so everything stops while waiting for the driver to finish compilation. To understand why this can happen we need to take a closer look at how shaders are translated into code that runs on the GPU.
Shaders are programs which execute on the GPU to perform the various steps involved in rendering 3D images: transformation, deformation, shadowing, lighting, post-processing, and so on. They are usually written in a high-level language such as HLSL, which must be compiled to machine code that the GPU can run. This process is similar for CPUs, where code written in a high-level language like C++ is fed into a compiler in order to produce instructions for a certain architecture: x64, ARM, etc.
However, there is a key difference: each platform (PC, Mac, Android, etc.) usually targets one or two CPU instruction sets, but many different GPUs, with wildly different instruction sets. An executable compiled 10 years ago for x64 PCs will run on chips produced today by AMD and Intel because both vendors use the same instruction set and because of the strong backwards compatibility guarantees they provide. In contrast, a GPU binary compiled for AMD will not work on NVIDIA or vice-versa, and instruction sets can even change between different hardware generations from the same vendor.
Therefore, while it’s feasible to compile CPU programs directly to executable machine code and distribute that, a different approach must be used for GPU programs. High-level shader code is compiled to an intermediate representation, or bytecode, which uses an abstract instruction set defined by the 3D API: DXBC for Direct3D 11, DXIL for Direct3D 12, SPIR-V for Vulkan etc.
Games ship these bytecode binary files so they have a single shader library, instead of one for each possible GPU architecture. At runtime, the driver translates the bytecode into executable code for the GPU that is installed in the machine. This approach is sometimes used for CPU programs too—for example, Java source code is compiled to bytecode so that the same binary can run on all platforms which have a Java environment, regardless of their CPU.
When this system was introduced, games had relatively simple and few shaders, and the transformation from bytecode to executable code was straightforward, so the cost of doing this at runtime was negligible. As GPUs became more powerful, we started having more and more shader code, and drivers also started to do sophisticated transformations to produce more efficient machine code, which meant runtime compilation cost became a problem. Things reached a breaking point in Direct3D 11, so modern APIs such as Direct3D 12 and Vulkan set out to fix this by introducing the concept of Pipeline State Objects (PSOs).
Rendering an object usually involves several shaders (e.g. a vertex shader and a pixel shader working together), as well as a number of other settings for the GPU: culling mode, blend mode, depth and stencil comparison modes etc. Together, these items describe the configuration, or state, of the GPU pipeline.
Older graphics APIs like Direct3D 11 and OpenGL allow for changing parts of the state individually and at arbitrary times, which means that the driver only sees the complete configuration when the game issues a draw request. Some settings influence the executable shader code, so there are cases when the driver can only start compiling shaders when the draw command is processed. This can take tens of milliseconds or more for a single draw command, resulting in very long frames when a shader is used for the first time—a phenomenon known to most gamers as hitching or stuttering.
Modern APIs require developers to package all the shaders and settings they will use for a draw request into a Pipeline State Object and set it as a single unit. Crucially, PSOs can be constructed at any time, so in theory engines can create everything they need sufficiently early (for example during loading), so that compilation has time to finish before rendering.
Unreal Engine has a powerful material authoring system which is used by artists to create visually rich and compelling worlds, and many games contain thousands of materials. Each of them can produce many different shaders—for example, there are separate vertex shaders for rendering one material on static vs. skinned vs. spline meshes. The same vertex shader can be used with several pixel shaders, and this is again multiplied by different sets of pipeline settings. This can result in millions of different PSOs that would have to be compiled upfront to ensure that all possibilities are covered, which is of course unfeasible for both time and memory considerations (loading a level would take hours).
A very small subset of these possible PSOs are used at runtime, but we can’t determine what that subset is by only looking at a material in isolation. The subset can also change between different game sessions: modifying video settings toggles certain rendering features, which makes the engine use different shaders or pipeline states. Early Direct3D 12 engine implementations relied on playtests, automated level fly-throughs, and other such discovery methods to record which PSOs are encountered in practice. This data was included with the final game and used to create the known PSOs at startup or level load time. Unreal Engine calls this a “Bundled PSO Cache”, and it was our recommended best practice until UE 5.2.
The bundled cache is sufficient for some games, but it has many limitations. Gathering it is resource-intensive and it must be kept up to date when content changes. The recording process might not be able to discover all the PSOs in games with very dynamic worlds: for example, if objects change materials based on player actions.
The cache can become much larger than what’s needed during a play session if there’s a lot of variation between sessions, e.g. if there are many maps, or if players can choose one skin out of many. Fortnite is a good example where the bundled cache is a poor fit, as it runs into all these limitations. Moreover, it has user-generated content, so it would have to use a per-experience PSO cache, and place the onus of gathering these caches on the content creators.
In order to support large, varied game worlds and user-generated content, Unreal Engine 5.2 introduced PSO precaching, a technique to determine potential PSOs at load time. When an object is loaded, the system examines its materials and uses information from the mesh (e.g. static vs. animated) as well as global state (e.g. video quality settings) to compute a subset of possible PSOs which may be used to render the object.
This subset is still larger than what ends up being used, but much smaller than the full range of possibilities, so it becomes feasible to compile it during loading. For example, Fortnite Battle Royale compiles about 30,000 PSOs for a match and uses about 10,000 of them, but that’s a very small portion of the total combination space, which contains millions.
Objects created during map loading precache their PSOs while the loading screen is displayed. Those that stream in or spawn during gameplay can either wait for their PSOs to be ready before being rendered, or use a default material which was already compiled. In most cases, this only delays streaming for a few frames, which is not noticeable. This system has eliminated PSO compilation stuttering for materials and works seamlessly with user-generated content.
Changing the material on an already visible mesh is a harder case, because we don’t want to hide it or render it with a default material while the new PSO is being compiled. We are working on an API which enables game code and Blueprints to hint the system ahead of time, so that the extra PSOs can be precached as well. We also want to change the engine to keep rendering the previous material while the new one is compiling.
Unreal Engine has a separate class of shaders which are not related to materials. These are called global shaders, and they are programs used by the renderer for implementing various algorithms and effects, such as motion blur, upscaling, denoising etc. The precaching mechanism covers global compute shaders too, but as of UE 5.5 doesn’t handle global graphics shaders. These types of PSOs can still cause rare one-time hitches when they are first used. There’s ongoing work to close this remaining gap in precaching coverage.
The bundled cache can be used in conjunction with precaching, and this can provide benefits for certain games. Some common materials can be included in the bundled cache, so that they are compiled on startup, instead of during gameplay. It can also help with global graphics shaders, since the discovery process will encounter and record them.
Drivers save compiled PSOs to disk, so they can be loaded directly when they are encountered again during subsequent game sessions. This helps games regardless of which engine and PSO compilation strategy they use. For Unreal Engine titles using PSO precaching, it means the loading screen will be noticeably shorter on the second run. Fortnite takes around 20-30 seconds longer to load into a Battle Royale match when the driver cache is empty. The cache is cleared when a new driver is installed, so it’s normal to see longer loading screens the first time a game runs after a driver update.
Unreal Engine takes advantage of the driver cache by creating PSOs during loading and immediately discarding them when they finish compiling—this is why the technique is called precaching. When a PSO is later required for rendering, the engine issues a compilation request, but the driver just returns it from the cache, because the precaching system made sure it’s there. Once a PSO is used for drawing, it will remain loaded until all the primitives using it are removed from the scene, so we don’t keep asking the driver for it every frame.
Discarding after precaching has the advantage that unused PSOs are not kept in memory. The downside is that fetching a PSO from the driver cache right when it’s needed can still take some time, and even though it’s much faster than compiling it, this can lead to micro-stutters the first time a material is rendered.
One simple solution is to keep precached PSOs instead of discarding them, but this can increase memory usage by more than 1 GB, so it should only be done on machines that have enough RAM. We are working on solutions for reducing the memory impact and automatically deciding when precached PSOs can be kept alive.
Only some of the states affect the executable PSO code. This means that when we create two PSOs which have the same shaders, but differ in pipeline settings, it’s possible that only the first one goes through the expensive compilation process, and the second is returned immediately from the cache.
Unfortunately, the set of states that matter for code generation is different between GPUs, and can even change from one driver version to another. Unreal Engine takes advantage of some practical knowledge which enables us to skip some permutations during the precaching process. Redundant requests are shorter thanks to the driver cache, but the engine still has to do work to generate them. This work adds up, so the pruning process is useful in reducing loading times, as well as memory usage.
Mobile platforms use the same on-device shader compilation model, and Unreal Engine’s precaching system is effective there as well. In general, the mobile renderer uses fewer shaders than on desktop, but PSO compilation takes much longer due to CPUs being slower, so we’ve had to make some tweaks to the process to make it feasible.
We skip some rarely used permutations, which means the precache set is no longer conservative, so in some cases there can be hitches if one of the uncommon states ends up being rendered. We also have a timeout for precaching during map loading to prevent showing the loading screen for an excessive amount of time. This means the game can start while there are still outstanding compilation tasks, so we will stutter if one of the in-flight PSOs is needed right away. We use a priority boost system to move tasks to the front of the queue when a PSO is required, to minimize these hitches.
Consoles do not need to solve this problem because they have a single target GPU. Individual shaders are compiled directly to executable code and shipped with the game. There is no combinatorial explosion from using the same vertex shader with multiple pixel shaders, or due to pipeline states, because these factors do not cause recompilation. Shaders and states can be assembled into PSOs at runtime without incurring a significant cost, so there are no PSO hitches on these platforms.
There’s a partial misconception that Direct3D 11 didn’t have these issues, and we occasionally hear calls to go back to the old compilation model or even to old graphics APIs. As explained earlier, hitches happened back then too, and due to the way the API was designed, engines had no way to prevent them. They were less frequent or shorter mostly because games had simpler and fewer shaders and some features such as raytracing didn’t exist altogether.
Drivers also did a lot of magic to minimize stuttering, but couldn’t avoid it entirely. Direct3D 12 tried to solve the problem before it got worse by introducing PSOs, but engines took a while to use them effectively, partly because of the difficulty to retrofit existing material systems, and partly due to shortcomings in the API which only became apparent as games grew in complexity.
Unreal Engine is a general-purpose engine with many use cases and lots of existing content and workflows, so the problem was particularly hard to tackle. We are finally reaching a point where we have a viable solution, and there are also good initiatives to address the API shortcomings, such as the graphics pipeline library Vulkan extension.
The precaching system has evolved a lot since its experimental introduction in 5.2 and it prevents most kinds of shader compilation stutters. However, it still has some coverage gaps and other limitations, so there are ongoing efforts to improve it further. We’re also working with hardware and software vendors to adapt drivers and graphics APIs to the way games use these systems in practice.
Our ultimate goal is to handle precaching automatically and optimally, so that game developers don’t need to do anything to prevent hitching. Until the system is finished, there are still some things which licensees can do to ensure smooth gameplay:
Remember, for more on this topic, you can join us on the Inside Unreal stream this Thursday, February 6 on Twitch or YouTube at 2PM ET.
How to install Unreal Engine
Download the launcher
Before you can install and run Unreal Editor, you’ll need to download and install the Epic Games launcher.
Install Epic Games launcher
Once downloaded and installed, open the launcher and create or log in to your Epic Games account.
Get support, or restart your Epic Games launcher download in Step 1.
Install Unreal Engine
Once logged in, navigate to the Unreal Engine tab and click the Install button to download the most recent version.