
EZMissile System Part 1: Actor Cache
In my new project, I have a specific requirement to fill the whole stage with projectiles, missiles and bombs. For this, a cache system is needed to manage the amount of actors on screen.
If you have some doubt if this is really necessary, you can check the code for UWorld::SpawnActor for the amount of validation that this function performs. Additionally, there are many posts talking about reducing number of actors that are ticking every frame (although we’ll not talking about ticking in this post).
In addition to the problems we need to solve, the scope of this plugin requires frequent addition and removal of object to/from the screen. Moreover, the actors require components like particle systems and collision detection, so the usage of instanced static mesh components don’t seen like a good idea.
Here we’ll talk about the iterations during the creation of the cache system and how it evolved from a simple implementation (that helped a lot) to a extreme high-performance library. It should help the managing of hundreds (if not thousands) of actors, so performance is critical.
First iteration: using TMultiMap
I started with a very simple idea of a class containing a TMultiMap holding a list of Actors by their class types.
#pragma once
#include "CoreMinimal.h"
#include "Containers/Map.h"
#include "Containers/Set.h"
#include "Templates/SubclassOf.h"
#include "Algo/Find.h"
#include "Kismet/KismetSystemLibrary.h"
template <typename InElementType>
class TActorPool
{
public:
//static_assert(TIsDerivedFrom<InElementType, AActor>::IsDerived, "ElementType must be derived from AActor");
using KeyType = TSubclassOf<InElementType>;
using MapType = TMultiMap<KeyType, InElementType*>;
using ArrayType = TArray<InElementType*>;
using SetType = TSet<InElementType*>;
using StatsType = TMap<KeyType, int32>;
TActorPool() = default;
TActorPool(TActorPool&&) = delete;
TActorPool(const TActorPool&) = delete;
~TActorPool() = default;
// Extract key from item
FORCEINLINE KeyType ParseKeyFromItem(InElementType* Item) { return Item->GetClass(); }
void Add(InElementType* NewObject)
{
KeyType Key = ParseKeyFromItem(NewObject);
PerClassMap.AddUnique(Key, NewObject);
}
void Remove(InElementType* NewObject)
{
KeyType Key = ParseKeyFromItem(NewObject);
PerClassMap.RemoveSingle(Key, NewObject);
}
int32 Num() const
{
return PerClassMap.Num();
}
int32 Num(KeyType Key) const
{
return PerClassMap.Num(Key);
}
InElementType* FindByPredicate(TSubclassOf<InElementType> Type, TFunctionRef<bool(const InElementType*)> Predicate) const
{
TArray<InElementType*> FoundObjects;
PerClassMap.MultiFind(Type, FoundObjects);
InElementType** FoundObject = Algo::FindByPredicate(FoundObjects, [&Predicate](const InElementType* Item)
{
// Check if missile is valid, for safety
if (!UKismetSystemLibrary::IsValid(Item))
{
// Problem with this instance, destroy it
UE_LOG(LogTemp, Display, TEXT("TMissilePool::FindByPredicate (line %d) -> %s is invalid"), __LINE__, *UKismetSystemLibrary::GetDisplayName(Item));
return false;
}
return Predicate(Item);
});
return FoundObject ? *FoundObject : nullptr;
}
/* Updates and return information about cache size for each subtype */
StatsType& QueryPoolStats()
{
TSet<KeyType> Keys;
PerClassMap.GetKeys(Keys);
for (KeyType Key : Keys)
{
InternalStats.Add(Key, PerClassMap.Num(Key));
}
return InternalStats;
}
/* Iterates by whole pool to remove invalid items */
void CheckPool(const ArrayType& InvalidItems)
{
for (auto It = PerClassMap.CreateIterator(); It; ++It)
{
if (!UKismetSystemLibrary::IsValid(It.Value()))
{
InvalidItems.Add(It.RemoveCurrent());
}
}
}
private:
MapType PerClassMap;
StatsType InternalStats;
};
This first version worked very well by reducing the number of spawn calls. But we have a problem hidden in FindByPredicate function. Suppose the cache has 200 items, with 160 missiles in use on screen, and a character needs to find a free to use missile. Even with the best algorithm, there’s a chance that a good part of the entire collection would be traversed, and for each item the predicate will be called. And we still have the problem that the calls to FindByPredicate are not thread-safe, so a race can occur if multiple characters need to find a free to use missile.
Second iteration: using TQueue
To solve both problems, TQueue suits very well. The item is inserted at one end of the queue, and is removed at the other to be used on scene. Although TQueue is NOT thread-safe for item removal, its safe for item addition.
Here’s the code for this version:
#pragma once
#include "CoreMinimal.h"
#include "Containers/Map.h"
#include "Containers/Set.h"
#include "Containers/MpscQueue.h"
#include "Templates/SubclassOf.h"
#include "Algo/Find.h"
#include "Kismet/KismetSystemLibrary.h"
template <typename InElementType>
class TActorPoolQueue
{
public:
//static_assert(TIsDerivedFrom<InElementType, AActor>::IsDerived, "ElementType must be derived from AActor");
using KeyType = TSubclassOf<InElementType>;
using QueueType = TMpscQueue<InElementType*>;
using MapType = TMultiMap<KeyType, QueueType*>;
using StatsType = TMap<KeyType, int32>;
TActorPoolQueue() = default;
TActorPoolQueue(TActorPoolQueue&&) = delete;
TActorPoolQueue(const TActorPoolQueue&) = delete;
~TActorPoolQueue() = default;
/**
* Add or return an element to the appropriate queue
*/
void Add(InElementType* NewObject)
{
KeyType Key = ParseKeyFromItem(NewObject);
QueueType* Queue = FindPerKeyContainer(Key);
Queue->Enqueue(NewObject);
}
/**
* Remove an element from the queue
*/
bool Remove(KeyType ObjectType, InElementType*& NewObject)
{
QueueType* Queue = FindPerKeyContainer(ObjectType);
Queue->Dequeue(NewObject);
return NewObject != nullptr;
}
// TODO refactor to count all items from queues
int32 Num() const
{
return PerClassMap.Num();
}
// TODO refactor to count all items from queues
int32 Num(KeyType Key) const
{
return PerClassMap.Num(Key);
}
/**
*
*/
InElementType* FindByPredicate(TSubclassOf<InElementType> Type, TFunctionRef<bool(const InElementType*)> Predicate) const
{
return nullptr;
}
/**
* Updates and return information about cache size for each subtype
*/
StatsType& QueryPoolStats()
{
TSet<KeyType> Keys;
PerClassMap.GetKeys(Keys);
for (KeyType Key : Keys)
{
InternalStats.Add(Key, PerClassMap.Num(Key));
}
return InternalStats;
}
private:
MapType PerClassMap;
StatsType InternalStats;
// Extract key from item
FORCEINLINE KeyType ParseKeyFromItem(InElementType* Item) { return Item->GetClass(); }
// Find queue by key type
FORCEINLINE QueueType* FindPerKeyContainer(KeyType Key)
{
QueueType** Queue = PerClassMap.Find(Key);
if (Queue == nullptr)
{
QueueType* NewQueue = new QueueType();
PerClassMap.Add(Key, NewQueue);
Queue = PerClassMap.Find(Key);
}
return *Queue;
}
};
This implementation simplified the management of the instances and is very fast. In our case, the expired missiles are inserted into the queue’s head, and requesting players get missiles from the queue’s tail.
This version is enough for our purposes. But I wanted to try something more advanced, and at the same time prevent some possible problems.
The first question is about the missile handling. While in use, the missile is outside the cache, without any knowledge of its status. This is not how a cache should behave.
Second question is that even performing better than search a free-to-use object in a TArray, TQueue internally allocates and deallocates memory because internally it is a linked list. Although the node object is very small, the allocations are made into the heap, and due to the behavior of this implementation can cause memory fragmentation and delays during insertion and removal. And yes, this is very low level, but to obtain better performance its required to save every cycle possible.
Field Research
At this point, I inspected many actor pools developed for Unreal Engine, either open source and paid plugins. The vast majority use a combination of TMap whose key is the class type (a TSubclassOf object) and the value being tuples containing two arrays, one for the active (on screen) objects, and another for the inactive ones.
They all work very well, but I have doubts about the performance under the high frequency requirements of addition and removal. At the end of the day, we are moving memory from one place to another and manipulating memory, and this has a cost.
I thought of a way to not move any memory, but just check the state of that object instance. Since each object will only have two states (active and inactive), one bit for each instance would be enough. So we can use a combination of TArray containing the actor instances and a bitmap to store the status for each instance, both controlled by a storage container for each class type. And the containers will be mapped by a TMap, because TMultiMap doesn’t support access list values by index, only the whole list.
Third iteration: creating the storage container
The difficult part here is the logic to parse and detect the first free instance in the instance list. We’ll use the same feature databases uses and create what is called a bitmap index. Bitmap indexes are not used for sorting, but to assign a status that maps to another list. This way we can easily query the index of some or all of the instances that has the status we desire.
Search for unused instances
For our bitmap index we’ll use a TArray<int64>, where each bit of each element represents the status of an item in instances array. To search the first free instance in the index, the algorithm is the following:
// Bitmap Index array
TArray<int64> UsageArray;
constexpr int32 IndicesPerInt64 = 64;
...
...
int32 FindFreeIndex() const
{
// For each item in the index list...
for (int32 i = 0; i < UsageArray.Num(); ++i)
{
// This is a GUARD: if whole item is 1 (i.e.: no available instance)...
if (UsageArray[i] == ~0LL)
{
// Then continue to the next item
continue;
}
// For each bit in the item...
for (int32 j = 0; j < IndicesPerInt64; ++j)
{
If the bit has value 0...
if ((UsageArray[i] & (1LL << j)) == 0)
{
// Return bit index combined with item index
return i * IndicesPerInt64 + j;
}
}
}
return INDEX_NONE;
}
As a side note, I like guard too much. Exit early if condition is not met simplifies the code flow too much (I’m not sure if this is performant from the CPU point of view).
For performance point of view, the algorithm above has room for improvement. First there are nested loops, which can scale by the number of items in the array. Second, the inner loop is used to compare bit by bit, which doesn’t seem like a problem, but there are instructions made exactly to help us. The FMath::CountTrailingZeros64 function fits perfectly on this flow, by checking the number of bit unset (= 0) until find the first bit set.
In our case, FindFreeIndex search for the first unset bits. So we need to invert the bitmap use the FMath::CountTrailingZeros64 function correctly.
// Bitmap Index array
TArray<int64> UsageArray;
constexpr int32 IndicesPerInt64 = 64;
...
...
int32 FindFreeIndex() const
{
// For each item in the index list...
for (int32 i = 0; i < UsageArray.Num(); ++i)
{
// For each bit in the item...
int32 FreeBitIndex = FMath::CountTrailingZeros64(~UsageArray[i]);
if (FreeBitIndex < 64)
{
If the bit has value 0...
if ((UsageArray[i] & (1LL << j)) == 0)
{
// Return bit index combined with item index
return i * IndicesPerInt64 + j;
}
}
}
return INDEX_NONE;
}
An unexpected optimization is that we can remove the guard clause, because when we check if the result of FMath::CountTrailingZeros64 is less than 64 is the same when we check if UsageArray[i] == ~0LL (remember we negate the input value).
Updating indexes
After selecting the available index in out bitmap index structure, we need to update the entry to mark it as usable. And when returning the item to the cache, mark it as free.
void MarkAsUsed(int32 Index)
{
int32 UsageArrayIndex = Index / IndicesPerInt64;
int32 BitIndex = Index % IndicesPerInt64;
UsageArray[UsageArrayIndex] |= (1LL << BitIndex);
}
void MarkAsFree(int32 Index)
{
int32 UsageArrayIndex = Index / IndicesPerInt64;
int32 BitIndex = Index % IndicesPerInt64;
UsageArray[UsageArrayIndex] &= ~(1LL << BitIndex);
}
There’s no secret here. Basically is the task to find the item and bit indexes given to us in FindFreeIndex, and and use some bitwise operations to alter the bit value.
To mark as used we do an OR operation between the index item and the number 1 shifted by the value of bit index. As an example, to update the bit 19:
Index item 0100101010010010101001001010100100101010010010101001001010100110
Shift value 0000000000000000000000000000000000000000000001000000000000000000 OR
--------------------------------------------------------------------------------
Index updated 0100101010010010101001001010100100101010010011101001001010100110
Unmark the bit it a little trickier, as we need to do an AND operation between the index item and the number 1 shifted by the value of bit index NEGATED. As an example, to update the bit 19:
Shift value 0000000000000000000000000000000000000000000001000000000000000000 NOT
--------------------------------------------------------------------------------
Shift updated 1111111111111111111111111111111111111111111110111111111111111111
Index item 0100101010010010101001001010100100101010010011101001001010100110
Shift updated 1111111111111111111111111111111111111111111110111111111111111111 AND
--------------------------------------------------------------------------------
Index updated 0100101010010010101001001010100100101010010010101001001010100110
As the only bit different from 1 is the one we want to unmark, it’ll change its value. Any other value will remain unchanged.
By closely observing the calculation used to recover the item and bit indexes, we know that IndicesPerInt64 is and will always be a multiple of two, so we can simply replace division (slow) and modulus (slower) operators by bitwise operations (very very fast). And as the code to parse the indices are the same, we can also move to another function.
FORCEINLINE void ParceInternalIndexes(int32 Index, int32& ItemIndex, int32& BitIndex)
{
ItemIndex = Index >> FMath::CeilLogTwo(IndicesPerInt64); // Optimized division
BitIndex = Index & (IndicesPerInt64 - 1); // Optimized modulo
}
FORCEINLINE void MarkAsUsed(int32 Index)
{
int32 UsageArrayIndex = 0, BitIndex = 0;
ParceInternalIndexes(Index, UsageArrayIndex, BitIndex);
UsageArray[UsageArrayIndex] |= (1LL << BitIndex);
}
FORCEINLINE void MarkAsFree(int32 Index)
{
int32 UsageArrayIndex = 0, BitIndex = 0;
ParceInternalIndexes(Index, UsageArrayIndex, BitIndex);
UsageArray[UsageArrayIndex] &= ~(1LL << BitIndex);
}
With these methods, we can now going deeper by checking the flow between FindFreeIndex and the function we will get the free instance from the cache:
AActor* GetFreeInstance()
{
int32 InstanceIndex = FindFreeIndex();
MarkAsUsed(InstanceIndex);
return InstanceArray[InstanceIndex];
}
Did you perceive that in FindFreeIndex we are parsing indexes from index list and the bit offset, and then obtaining both indexes by parsing the value inside MarkAsUsed function. We can obtain both indexes and mark the instance as used this way:
int32 FindFreeIndex()
{
for (int32 It = 0; It < UsageArray.Num(); ++It)
{
// Optimized search through bits
int32 FreeBitIndex = FMath::CountTrailingZeros64(~UsageArray[It]);
if (FreeBitIndex < 64)
{
// Mark item as "in use"
UsageArray[It] |= (1LL << FreeBitIndex);
// Return instance index
return It * IndicesPerInt64 + FreeBitIndex;
}
}
return INDEX_NONE;
}
We need to remove the const modifier in this function, because we are changing fields from the function’s owner. Also, we can remove the MarkAsUsed call from GetFreeInstance.
AActor* GetFreeInstance()
{
int32 InstanceIndex = FindFreeIndex();
return InstanceArray[InstanceIndex];
}
Final Code and Results
And what about the comparison among all implementations? Using a unit test to barely simulate the usage in game by ramdomly adding and remove instances from the pool, we can estimate the performance. The unit test was run 15 times, each time executing 10,000 operations with random adding and removal, and the average times are in the table below:
Implementation | Average Test Run Time (in secs.) |
---|---|
TMultiMap w/ FindByPredicate | 0.0392638 |
Queue | 0.0028125 |
Bitmap w/o optimizations | 0.0015458 |
Bitmap optimized | 0.0013150 |
It is important to realize that neither execution time is bad, but the time taken by each implementation is noticeably better. In my point of view, either queue or bitmap implementations can be used.
I hope this article can be useful. I used some knowledge from corporative development to improve performance in these low level tasks. I’ll talk about batch processing for missile and projectile movement in the second part of this series.