Performance at scale: Sequencer in Unreal Engine 4.26

Epic Games Senior Programmer Andrew Rodham
As the complexity, scale, and fidelity of real-time cinematic content continues to push the envelope of quality in Unreal Engine; we can critically assess the runtime capabilities of Unreal Engine’s cinematic tool, Sequencer, and identify areas of optimization potential.

With UE 4.26, Sequencer has been heavily optimized through a reworking of its internal architecture to enable much higher performance for large-scale cinematics and concurrent UI animations than previously possible. In this tech blog, we explore how re-organizing Sequencer’s data structures and runtime logic using data-oriented design principles provides a host of benefits that include highly improved optimization potential when dealing with large data sets and enabling greater third-party extensibility. Moving forward, these optimizations pave the way for more interactive and dynamic content authored through Sequencer.
 

What’s in a Sequence?

For those that don’t know, Sequencer is Unreal Engine’s non-linear editing tool, which allows keyframed animation of cut-scenes, in-game events, UMG widgets, offline rendered movies, and more. It principally comprises three elements: Object Bindings, Tracks, and Sections.
 
Figure 1--Element locations in Sequencer
 
Sections generally contain keyframe data and other properties that define the desired states for each transform, property, or animation along its timeline. When playing a sequence in-game, the engine’s job is to take all these tracks and sections and apply the correct states and properties to the desired objects resolved from the game world.

Prior to 4.26, each of these tracks and sections had its own runtime instance that contained both the data and logic required to evaluate at a specific time and apply the properties to resolved objects. These instances were designed using traditional Object-oriented paradigms and are accessed through a virtual API. While this approach yields acceptable performance with small numbers of tracks, it scales poorly as the number of tracks and their relative complexity increases. Furthermore, running larger numbers of animations or sequences in parallel would accrue high-level performance costs initializing the pipeline for each instance.

With access to CPU profile data of a wide range of use cases it became apparent that optimization efforts yield diminishing returns at scale due to the inherent overhead of virtual function calls and poor cache locality. Additionally, tracks such as the Transform track were becoming burdened with an increasing number of responsibilities such as blending, partially animated transforms, relative transform origins, attachments, component velocity, and more.

It was clear that a more systematic redesign was required in order to achieve the optimizations necessary to run content like this in real-time alongside all other game systems on console and/or mobile hardware.
 

Designing for Speed

When looking at a new foundation for the Sequencer runtime, we focused on the following design goals:

Scalability
With the new runtime, it should be possible to author content with hundreds of tracks or sequences and optimize evaluation logic for this content as a whole. This includes:
  • Allocating and organizing data so that performance does not degrade rapidly with the number of active sequencer tracks.
  • Ability to completely remove logic and branches that are known to not be necessary, rather than paying that cost every frame. No code is the fastest code. This should be applicable to all aspects of Sequencer evaluation code, not just the highest level.
  • Evaluation logic should have direct, efficient, and unhindered access to the necessary data in batches without having to interact with memory through complex or inefficient abstractions.

Concurrency
Writing evaluation logic should be simple and naturally expanded to multiple cores in a safe and efficient manner, including the expressive definition of upstream/downstream dependencies (for example, evaluating all curves in parallel). This benefits many small, lightweight animations due to the combined benefits of only setting up the pipeline once, as well as very large sequences.

Extensibility
Building logic on top of built-in functionality should be possible without re-implementing core systems. Adding upstream or downstream features that interact with core functionality should be reasonably attainable. This includes:
  • All data for the current frame should be transparent and mutable at any point in the pipeline.
  • Reliable dependency management.
These design goals, along with Sequencer’s problem domain, naturally lends itself to data-oriented design principles for several reasons:
  • The majority of Sequencer data is homogeneous and can be laid out sequentially.
  • Logic for each data transformation is generally very self-contained and context-free (i.e., running f(x) for a curve).
  • The control and data flow are very linear, with no cyclic or recursive dependencies (i.e., there is no logic that requires the re-computation of already computed values).
  • Only the initial setup and final property setters have threading restrictions.

In 4.26, the foundational runtime for Sequencer was redesigned to lay out evaluation data in a way that is cache-efficient and transparent through an evaluation framework based on the entity-component-system pattern. Track instances (internally referred to as templates) are no longer required for evaluation (though the legacy runtime still executes alongside the new system). Instead, the data and logic for each type of track are decoupled: entities now represent the component parts that relate to the source track data, and systems represent a single logical data transformation that relates to a specific component or combination of components. Systems now operate on all data that matches their query in batches. This enables cache-efficient application of, for instance, all transform properties, or evaluation of all float channels with just a single virtual function call and very few cache misses regardless of how many transform properties are being animated.

The result is a significant reduction in overhead for Sequencer’s evaluation of large content for migrated track types from 4.26 onward. 

Additionally, this approach also enables us to coalesce the evaluation data for many sequences or Widget Animations together, enabling greater optimization potential across the entire set of running animations. This allows more Widget Animations to run together than previously possible, and also enables multiple separate animations or sequences to be able to blend together if necessary.

Pipeline Example 
Let’s take a look at an example pipeline for evaluating all transform tracks. A transform track in Sequencer comprises between one and nine keyframe channels for its composite floats - Location X/Y/Z, Rotation Roll/Pitch/Yaw, and Scale X/Y/Z. In order to apply the transform to an object, we must first evaluate these curves at the current time, then call “USceneComponent::SetRelativeTransform” with the result. Prior to 4.26, the channel data was copied to a template object which contained up to nine channels and the logic for evaluating these channels and applying them to an object:
Figure 2
Figure 2--Transform Track Evaluation prior to 4.26
 
In this design, the number of virtual function calls (dynamic dispatch) scales linearly with the number of tracks and with it: the potential for cache misses.

In the new framework, a transform entity will be created when this track is resolved against its object along with pointers to the composite float channels; floats for the evaluated result of each channel; a function pointer to “USceneComponent::SetRelativeTransform”; and a tag signifying the entity is a transform property. Figure 3
Figure 3--New Transform Track Evaluation from 4.26 onward
The new approach is significantly improved for several reasons:
  • Better separation of concerns, data, and instruction cache locality, and is demonstrably faster at scale; 
  • The number of virtual function calls scale with the number of active system types rather than the number of individual tracks; 
  • The number of cache misses is proportional to the number of combinations of component data, rather than the size of the data as a whole; and, 
  • There is a possibility for these gains to be made across the full data set, rather than optimizations being constrained to each track. 
  • Parts of the pipeline can now be made concurrent with no meaningful change to the logic.
For example, the ‘Evaluate Float Channels’ system in the above example is able to evaluate all float channels for the current frame in parallel, regardless of what properties they animate or how they are to be combined or applied.

Furthermore, introducing intermediate data transformations, such as blending multiple transforms together into a single object becomes simpler and more compartmentalized.

Before 4.26, this functionality was implemented in a dedicated class that stored and processed all inputs/outputs for all property types. This introduced a storage and CPU overhead due to the cost of small-allocation optimization and dynamic dispatch. It also required expressing the application of properties in a way that understood the blending code path, which further complicated and degraded the common (no blending) codepath.
Figure 4
Figure 4--Transform Track Evaluation blending prior to 4.26
The new pipeline allows this blending code to live in its own system, with inputs and outputs being assigned a unique ID. Not only is the operation faster due to the parallelism and efficiency gains of the data model, but when no blending is required, the system does not exist at all, which results in zero runtime overhead. 
Figure 5
Figure 5--Transform Track Evaluation blending from 4.26 onward

Memory Layout

Let’s dig a little deeper into what evaluation data looks like in memory for Sequencer. We prototyped various approaches to data layouts and settled on the following data model as it gives a good compromise between memory size, cache efficiency, reallocation cost, and concurrent access. 

Entities are grouped in batches referred to as entity allocations, with each allocation containing at least one entity. Crucially, all entities within an allocation have exactly the same number and types of components. When a component is added or removed from a specific entity, it is migrated to a different (or new) allocation. Allocations are dynamically sized with reserved capacity, allowing new entities to be added without a call to malloc (all that is required is the initialization of the component data).
Figure 6
Figure 6--FEntityAllocation and FComponentHeader Allocation examples
For example, an allocation with Float Result 0, Float Channel 0, and Eval Time components is shown below with example data and the logic relating to evaluating these channels.
Figure 7
Figure 7--Example data from Entity Allocation
As you can see, all components of the same type are laid out sequentially within an allocation. For an allocation containing Components A, B, and C, this results in a layout of AAA..A, BBB..B, CCC..C, with each type being aligned to a cache line. Organizing data in this way allows us to control the location of entity data without impacting its entity ID and also allows us to control read or write access to each component array (for now, this is just by way of a mutex, but this may be expanded to a specialized scheduler in future if contention becomes a problem) while maintaining good packing between the types. If a system wants to read all component data for Components A and B, it only needs to check the type of the allocation (determined by a bit mask). From this, the system knows immediately whether the (potentially thousands of) entities within match the types it’s interested in. 

This pattern-matching approach completely decouples data transformation logic from the data’s composition, allowing systems to operate on many different combinations of component data in parallel without knowledge of the unrelated parts, whilst still maintaining performance at scale. To expand upon the example above, inserting a system that is able to make all transform tracks relative to a transform-origin is trivial and does not require intrusive changes to any of the core transform systems.
Figure 8
Figure 8--Transform Track Evaluation with Transform Origins from 4.26 onward

Update Phases

Sequencer systems are able to run within any number of different phases that are run depending on context, each with their own designations and restrictions. These boundaries help to enforce some of the stricter ordering requirements found in Sequencer, while still making asynchronous evaluation logic possible for the majority of systems that run every frame. These are loosely divided into two sections: systems that only run when a boundary has been crossed or bindings have been invalidated (Spawn and Instantiation phases), and systems that run every frame (Evaluation and Finalization). 
  • Spawn: Specifically houses spawnable logic and events that must occur before or after bindings are resolved. Cannot dispatch async tasks.
  • Instantiation: Contains any system that wishes to create/destroy entities, or add/remove components and tags in Linker->EntityManager. This phase is only run when the Entity Manager structure has changed, or object bindings have been invalidated. Cannot dispatch async tasks.
  • Evaluation: Systems that need to run every frame to produce an evaluated state. Cannot mutate the structure of the entity manager (i.e., add or remove entities or components). The majority of systems will run here.
  • Finalization: Ran right at the end of the frame.

The distinction of these phases allows systems to execute more expensive setup/teardown logic only when absolutely necessary while keeping 90% of the common codepath as lean as possible.

The spawn and instantiation phases are only run when there has been a structural change in the entity manager, such as when a new section is being evaluated, a section is no longer evaluated, or when bindings are invalidated. These phases generally perform expensive tasks like resolving properties, mutating entity structure, or caching pre-animated values. Inside these phases, the entity manager can be changed and mutated, whereas during the Evaluation and Finalization phases, the Entity Manager is locked down and cannot structurally change (even though component data can still be written to or read from).

This is a typical sequence flow for running a Sequencer evaluation. As you can see, when boundaries are not crossed in any sequence, only the Evaluation and Finalization phase are executed. Figure 9
Figure 9--Sequencer Evaluation System Phase Update

Threading

Since we are now able to define logic in terms of the types of components that it reads from and writes to, it is now also possible to define upstream and downstream dependencies to component data, which allows us to safely and automatically dispatch these data transformations to the Task Graph asynchronously where it makes sense. When evaluating a large data set such as content with many complex keyframe curves, the evaluation of these curves is now able to run asynchronously alongside other computations, allowing for concurrency as the platform allows. This is only beneficial where the potential cost of the context-switch or thread preemption is justifiable, but Sequencer’s thread-safe design now allows that decision to be made internally or at user-discretion.

A more comprehensive exposition of the architecture for programmers working with Sequencer will be detailed in due course.
 

Behavior Changes from Older Runtime Systems and 4.26 Onward

In general, you should not experience any change in behavior; spawnables, attachments, transforms, and other properties should all continue to operate as before. However, there are some subtle changes to behavior as outlined below:
  • Active Sequences of a given type are now all evaluated together at the same time. Prior to 4.26, if two separate active sequences animated the same property on the same object, they would fight over control for the property due to being evaluated separately. In 4.26, these sequences will evaluate together, allowing them to blend naturally together. In the future, this will be expanded to allow control over how sequences blend or override each other.
    • All active Widget Animations are now evaluated together.
    • All active Level Sequences are now evaluated together.
  • Manually playing, stopping, or setting positions for sequences or widget animations may no longer re-evaluate the sequence synchronously inside that function call. Instead, the request to play or move the sequence playhead might be deferred until the next frame.
    • This can be disabled for specific sequences at a performance cost by disabling the Async Evaluation option on the sequence. Doing so will force synchronous re-evaluations, and disable the blending behavior described in (1).
Figure 10
Figure 10--Async Evaluation sequence check 
  • Absolute Blend tracks with easing curves now blend from the object’s initial value. Prior to 4.26, blend weights were always normalized, even when there was only a single contributor. This meant that relative-to-absolute blends required two sections.
     
    In 4.26, a relative-to-absolute blend can be expressed simply as an absolute section with an ease in/out. In Figure 11 below, you can see an example of absolute blend with the ease in/out. 

 
Figure 11--Example of Absolute Blend
 
  • Event tracks that cause reentrancy with respect to sequence evaluation (such as events that play other sequences, change playback state, or move the playhead) are now restricted to the Post Evaluation position (ie, no longer allowed in the PreSpawn or PostSpawn). This is now the default for new event tracks. Prior to 4.26, event tracks defaulted to PostSpawn. While existing tracks will keep the PostSpawn settings, new tracks will be set to PostEval by default. 
Figure 12Figure 12--Changing event track position to At End of Evaluation

Breaking API Changes

UE::MovieScene Namespace
The entirety of the new Sequencer codebase is contained within the UE::MovieScene namespace to reduce the need for excessive class name prefixing. As part of this work, any code contained within the existing MovieScene namespace (which was not heavily used) was moved to UE::MovieScene to ensure consistency and adherence to current Coding Standards.

In the unlikely event that third-party code references the defunct MovieScene namespace, types or functions will need to be changed to UE::MovieScene.

UMovieSceneTrack
If users have custom track implementations, the APIs relating to compilation of track templates have migrated to a separate interface (IMovieSceneTrackTemplateProducer) in order to better separate the different methods for evaluation. In order to define functions such as CreateTemplateForSection, CustomCompile, and PostCompile this interface should be added to your UMovieSceneTrack type since they no longer exist on the base class.

Track/Segment Blenders
GetRowSegmentBlender and GetTrackSegmentBlender are now defunct and no longer called; track evaluation fields are now re-generated on modification and saved into the asset, so they don’t need recalculating at compile-time.

UMovieSceneTrack::PopulateEvaluationTree is the new method for defining custom overlap behavior, with built-in algorithms available in FEvaluationTreePopulationRules.

UMovieSceneSection
UMovieSceneSection::GenerateTemplate has been removed in favor of defining templates through IMovieSceneTrackTemplateProducer::CreateTemplateForSection.

ChannelProxy
In the rare case that third-party section types have dynamic channel layouts (i.e., ChannelProxy is re-created outside of the constructor), its construction should be defined in the new function virtual EMovieSceneChannelProxyType CacheChannelProxy().

Summary

While only spawnables, attachments, events, 2D/3D transforms, and float properties have been ported to the new runtime system, we’re already seeing significant speedups for content that makes heavy use of these features. Applying data-oriented design principles to Sequencer’s specific problem domain has not only created the scope for optimization improvements across the board, but has also enabled better separation of concerns within the runtime, and will become the foundation upon which new features such as keyframe/track/sub-sequence parameterization, global blending, and better debugging tools can be built while maintaining high-performance.

The following example shows 500 separate sports car actors with their own hand-keyed animation running side-by-side in 4.25 and 4.26. The overhead of calling SetRelativeTransform manually was measured at around 2.5ms, which shows an improvement in Sequencer overhead of roughly 7.5x.
 
Figure 13--Demonstration of performance differences in the SetRelative Transform costs
 
We hope you found this information to be both interesting and useful. To learn more about Sequencer, check out the tool’s documentation page.