April 21, 2015

Optimizing TArray Usage for Performance

By Zak Middleton

In Unreal Engine, TArray is a dynamically sized array of typed elements. TArrays are very convenient to the programmer, and they are used *a lot* in our codebase. However there can be some subtle performance issues that can arise, and for optimal performance, you need to make sure you understand what goes on behind the scenes.

Making the smart choice: is TArray right for you?

As always when talking about performance, you should profile your code to make sure you are addressing the right areas of concern. It's also good to have smart up-front practices that avoid potential issues from the start.

For container types (such as TArray, TSet, TMap, etc), often the most important thing to consider is whether you are using the right container given the operations of your code in the critical path. For example, if you need to keep a list of unique elements and add, remove, or search for them often, TArray is usable but TSet is probably a better choice.

If you want blazing-fast element iteration in a compact memory space, then TArray is a great choice. However, you need to look out for the implications of those other operations (add, remove, etc) as you write your code. Here we'll look at a few simple ways to ensure your TArray usage is smart and doesn't take up valuable optimization time when it's time to ship your game!

1. By default TArray grows and reallocates memory as items are added, according to an allocation policy.

Adding elements to TArray is easy, which is part of what makes TArray so convenient. Under the hood, more memory is allocated periodically as more items are added, and each time it does so it must copy the elements to the new space and then free the old memory. This can be expensive, and we'd like to avoid that if possible. There are a few things we can do to avoid performance hits:

1.1: Reserve space up front if you know how many items are going to be added, or have an idea of an upper bound.

For example, adding one line of code to call "Reserve" guarantees at most one allocation within this function:

// Adds N copies of a character to the end of an array.
void AppendCharacterCopies(char CharToCopy, int32 N, TArray<char>& Result)
{
	if (N > 0)
	{
		Result.Reserve(Result.Num() + N);
		for (int32 Index=0; Index < N; Index++)
		{
			Result.Add(CharToCopy);
		}
	}
}

Normally adding elements one-by-one N times can cause a few reallocations and copies along the way as the array periodically expands, but here we ensure that at most one allocation can occur by taking the current number of elements and ensuring that Result can hold at least N additional elements before we start adding them. If there was already enough room, no allocation had to occur at all!

1.2: Take TArrays by reference when used as function parameters.

In the example in 1.1, note the way the function was declared to take a reference of the TArray. Consider an alternate declaration:

// takes array by value!
void AppendCharacterCopies(char Original, int32 N, TArray<char> Result)

By passing by value, the function caller will first make a copy of their array before passing it to AppendCharacterCopies(). This is expensive, as it allocates memory and copies all elements to a new TArray. Not to mention it breaks the intent of the function, which is to affect the caller's copy of the array!

This might seem basic, but that simple omission of the "&" can cost some serious cycles.

2. By default, TArray will shrink and reallocate memory as items are removed, again according to an allocation policy.

TArray stores elements in a compact, linear array. Removing elements from anywhere but the end of the list causes all elements after that location to be shifted in to fill the gap. Additionally, TArray assumes that if you are removing elements you may want the memory back, so occasionally as elements are removed the memory will be reallocated to reduce memory usage. This requires copying the elements to new memory and freeing the old space.

Thankfully there are a few options to mitigate this:

2.1: Reduce copies when removing elements.

One approach:

// initial array:
// 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

// remove element at index 3
Array.RemoveAt(3);

// temporary state as element is being removed
// 0, 1, 2, _, 4, 5, 6, 7, 8, 9
// elements 4,5,6,7,8,9 all shift left
// 0, 1, 2, 4, 5, 6, 7, 8, 9

There's a way to avoid the performance hit of shifting all elements to fill in a gap when an element is removed: RemoveAtSwap(). This removes the element at the requested index, then if possible replaces it with the last element in the array to fill in the hole, rather than shifting down all elements.

Faster approach:

// initial array:
// 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

// (remove element at index 3, allowing it to swap in the last element to replace it)
Array.RemoveAtSwap(3);

// temporary state as element is being removed
// 0, 1, 2, _, 4, 5, 6, 7, 8, 9
// element 9 swaps position to where 3 previously was
// 0, 1, 2, 9, 4, 5, 6, 7, 8

With one simple change we've again improved performance. Remember this works provided that you don't care about the order of elements in the array after removal.

2.2: Control memory cleanup.

Most TArray element removal functions take a "bAllowShrinking" parameter. This basically controls whether the TArray is allowed to free up space when elements are removed. Sometimes this is a good idea so you don't waste memory for arrays that last a long time, but in cases of short-lived arrays this takes valuable time and pollutes the cache.

Let's look at TArray::RemoveAtSwap:

void RemoveAtSwap(int32 Index, int32 Count = 1, bool bAllowShrinking = true)

In this example we remove all elements with a certain criteria (value is even), and tell the array not to change memory allocation size when items are removed.

void RemoveEvenValues(TArray<int32>& Numbers)
{
	for (int32 Index = 0; Index < Numbers.Num(); Index++)
	{
		if (Numbers[Index] % 2 == 0)
		{
			Numbers.RemoveAtSwap(Index, 1, false);
			Index--;   // since we need to process this index again!
		}
	}
	
	// Optional: reclaim empty space after values have been removed
	Numbers.Shrink();
}

TArray already has some built-in functions to do similar work: RemoveAll, RemoveAllSwap, RemoveSingle, RemoveSingleSwap, etc. are all useful tools to be able to remove certain elements from the array, and they already have efficient implementations.

3. TArray's default allocator is a dynamic (heap) memory allocator.

Sometimes this is fine and what you want (especially if arrays can grow quite large), and is part of what makes TArray so convenient. But there are costs to heap allocation: locks, the cost of finding free blocks, and cache pollution since you are now accessing more address space.

However there are options which in some cases can improve performance with little effort. One such option is to use a different allocator than the default. Normally this sounds intimidating, but UE4 makes this quite easy.

NOTE: Non-default allocators, such as TInlineAllocator, are not supported when used with UPROPERTY. Using a non-default allocator will restrict the TArray to only being available in code.

One useful allocator that I'm going to talk about is TInlineAllocator. It is an allocator specially designed for containers (such as TArray) that takes two template parameters:

- the number of inline elements

- a secondary allocator (which defaults to a heap allocator).

The first argument (number of inline elements) is what we're most interested in-- this allocator reserves space in memory immediately where the allocator is instanced. Once that space is exhausted, the allocator moves the elements to overflow space created by the secondary allocator. For TArray, this means an array declared on the stack will reserve space directly on the stack for that many elements. A TArray declared as part of a class or struct will store the inline allocation as part of that class or struct-- no dynamic allocation for the first N elements added as part of that array.

This has several advantages. If you have a good idea of the max number of elements expected in your array in common cases, you can completely avoid any heap allocation, because it all goes in the space created as part of the TArray. Additionally, as elements are added, there isn't the need to periodically grow the array and copy the elements to newly allocated space.

What this means is that we can take a normal TArray like this:

TArray<Shape*> MyShapeArray;

and easily change it to:

TArray<Shape*, TInlineAllocator<16>> MyShapeArray;

In this example you avoid any dynamic allocation for the first 16 elements added to the array, because they fit in the area on the stack that is part of the TInlineAllocator. At 17 elements and beyond, all elements are moved to the secondary allocator (ie heap allocator) for storage.

The size of the TInlineAllocator must be determined at compile time. It's up to you to decide the trade-off of potential speed-ups versus reserving the space up front for *every* instance of your array. It's also worth mentioning TFixedAllocator, which is similar to TInlineAllocator except that it has no secondary backup allocator; if it runs out of space in the fixed memory portion, your code will cause an error.

4. Making your types easier to use

In general a nicer way to declare your array types (which also saves you some typing!) is to use a typedef. This also has the advantage of making it easy to update any hardcoded number of inline elements in one place and have it propagate to all instances. Examples:

// declare type
typedef TArray<Shape*, TInlineAllocator<16>> ShapeArrayType;

// create an instance
ShapeArrayType MyShapeArray;

When iterating TArrays of any type, it's convenient to use the typedef to avoid typing out of the TArray type definition, which also makes changing the type later much easier:

for (ShapeArrayType::TIterator Iter(MyShapeArray); Iter; ++Iter)
{
	Shape* MyShape = *Iter;
	MyShape->Draw();
}

Alternatively you could use the C++11 "auto" keyword and replace the first line with:

for (auto Iter = MyShapeArray.CreateIterator(); Iter; ++Iter)

A nice compact method is to use range-based for loops (refer to this blog post for more info). Just remember when using the "auto" keyword that you typically want to capture the element type as a pointer or reference to avoid undesirable copies:

for (auto* MyShape : MyShapeArray)
{
	MyShape->Draw();
}

5. TArray type conversion

Keep in mind that TArrays (or any container) with different allocators are in fact a different type, so they cannot be automatically converted to another TArray type. For example you cannot pass MyShapeArray from above to a function taking a plain TArray<Shape*>.

int32 GetNumShapes(TArray<Shape*>& ShapeArray)
{
	return ShapeArray.Num();
}

void TestCompile()
{
	TArray<Shape*> HeapShapeArray;
	TArray<Shape*, TInlineAllocator<16>> InlineShapeArray;
	
	FillShapeArray(HeapShapeArray);
	FillShapeArray(InlineShapeArray);
	
	// ok, type is TArray<Shape*, FDefaultAllocator>&
	const int32 A = GetNumShapes(HeapShapeArray);
	
	// *error*: cannot convert from TArray<Shape*, TInlineAllocator<16>>&
	// to TArray<Shape*, FDefaultAllocator>&
	const int32 B = GetNumShapes(InlineShapeArray);
}

However with some simple template code, you can make it work just fine:

template<typename AllocatorType>
int32 GetNumShapes(TArray<Shape*, AllocatorType>& ShapeArray)
{
	return ShapeArray.Num();
}

void TestCompile()
{
	TArray<Shape*> HeapShapeArray;
	TArray<Shape*, TInlineAllocator<16>> InlineShapeArray;
	
	FillShapeArray(HeapShapeArray);
	FillShapeArray(InlineShapeArray);
	
	// ok, compiler infers template type of allocator as FDefaultAllocator
	const int32 A = GetNumShapes(HeapShapeArray);
	
	// ok, compiler infers template type of allocator as TInlineAllocator<16>
	const int32 B = GetNumShapes(InlineShapeArray);
	
	// ok, explicit with template type, but not really necessary
	// or very reusable with hard-coded "16"
	const int32 C = GetNumShapes<TInlineAllocator<16>>(InlineShapeArray);     
}

That's it! With a simple addition, we made more flexible and work with TArrays of any allocator type.

One final word of caution: beware of the code bloat that can come when adding template arguments to large functions, since the compiler generates a version of the code for each template type that is used. Another option is to simply use a consistent typedef within your code, which still allows only a single type but makes it more easily changed.