All posts


  • Text Generation for Unity Jobs

    You have a NativeArray of floats and and want to add a number to each element. You make an addition job. Easy!

    [BurstCompile(FloatPrecision.Standard, FloatMode.Fast, CompileSynchronously = true)]
    public struct AdditionJob : IJobParallelFor
        [ReadOnly] public float NumberToAdd;
        public NativeArray<float> Values;
        public void Execute(int index)
            Values[index] += NumberToAdd;

    Then you do a quick test and realise that using float4s in the job instead of floats will allow Burst to use SIMD operations. This cuts the time to add to four million elements to almost a third (from 1.996ms to 0.765ms). So you copy and paste the job for a float4 version. Later in the project you realise that int and int4 version would be useful as well. So calling on the Rule of Three, you decide to generalise somehow.

    You’d like to make jobs for each Unity.Mathematics numeric types; double, float, int and uint with the dimensions up to 4. This ends up being a lot of jobs! Unfortunately you can’t use Generics because there is no addition constraint. So the only real alternative is to use T4 text generation.

    T4 templates let you write C# code to generate text. You may have noticed that the Unity.Mathematics code is all generated by T4 templates. And Unity.Collections is using T4 text generation to generate types for FixedList and NativeString.

    So to generate jobs for each numeric type and each dimension I can use this template code:

    var TYPES = new []{"double","float","int","uint"};
    foreach (var TYPE in TYPES)
        for (int i = 1; i <= 4; i++)
            for (int j = 1; j <= 4; j++)
                string NUM1 = i==1 ? "" : i.ToString();
                if (i == 1 && j > 1) break;
                string SEP = j==1 ? "" : "x";
                string NUM2 = j==1 ? "" : j.ToString();
                var TYPE_FULL = $"{TYPE}{NUM1}{SEP}{NUM2}";

    and then the job looks like:

    [BurstCompile(FloatPrecision.Standard, FloatMode.Fast, CompileSynchronously = true)]
    public struct AdditionJob_<#=TYPE_FULL#> : IJobParallelFor
        [ReadOnly] public <#=TYPE_FULL#> NumberToAdd;
        public NativeArray<<#=TYPE_FULL#>> Values;
        public void Execute(int index)
            Values[index] += NumberToAdd;

    Running it gives you all the combinations you need.

    Burst Addition Jobs

    You can even use the text templates to generate the tests. This may seem like a waste of test time, but it actually helped me find precision differences between the float and float4 versions of these jobs.

    Burst Addition Job Tests

    Very handy. I’m sure I’m going to be using more T4 templates in the future.


  • Native Sparse Array

    (Part of a series where I think out loud about an out-of-core point cloud renderer I’m working on.)

    A sparse array uss the classic programming trick of adding another layer of indirection. Not by using a pointer, but by adding an abstraction over a traditional array.

    I’m in the process of building an Octree for use in a Point Cloud Renderer. The Octree will store nodes layer by layer. This means that some layers will have mostly empty space. Rather then storing the nodes directly, this sparse array has two arrays, one to store the data and the other to store the indices of the data. The data is always sorted, so a binary search can be used to find data or insertion points for adding data.

    A sparse array is usually slower for large data sets, and a hash map is preferred. But for the point cloud renderer, most of the time will be bulk processing the nodes for culling, etc. So having all the nodes in contiguous memory is just what I want.

    I’d like the SparseArray API to be a close a possible to the NativeArray. But with two caveats:

    • The CopyTo, CopyFrom and ToArray functions don’t make a lot of sense, how should a sparse array get converted to a normal array? You can use the pulblic field Indices and Data to get access to the underlying arrays if you’d like to copy to and from NativeArrays.
    • Indexing into the the array can add data. For example array[1]=2 will add an index 1 with data 2. The array is fixed size, so simple index access can fail if the array is full. It will fail silently fails is collection checks are off. Prefer to IsFull, ContainsIndex and AddValue to explicitly add items and check for errors.

    Native Collection Extensions

    The Native Collection API is fairly sparse. So I’ve added the following extension methods:

    • Swap
    • Insert
    • RemoveAt
    • BinarySearch

    Strangely there is no common interface for the Native Collections, so the common pattern is to make a private function that implements the algorithm using raw pointers. This gets called by type specific wrappers. This means that the assembly must be marked unsafe, and needs a decent amount of unit tests and precondition tests. I’ve followed the example of the native collections and used ENABLE_UNITY_COLLECTIONS_CHECKS to enable the precondition checks.

    Job Safety

    NativeSparseArray does not do atomic read or writes. It can still safely be used within a job as long as the each job access individual array indices. The underlying indices and data arrays are available as public properties and can be used within Jobs.

    Native Container Resources

    If you’d lke to write your own native containers then the best resources are:


    The code for the NativeSparseArray and Native Collection Extensions are available in the Github repo.


  • Morton Order - Burst

    (Part of a series where I think out loud about an out-of-core point cloud renderer I’m working on.)

    Morton encoding is a perfect set up for Burst, lots of data in a row to crunch!

    So what happens when I turn it on? In order to use Burst, we need to have our code in a job.

    [BurstCompile(FloatPrecision.Standard, FloatMode.Fast, CompileSynchronously = true)]
    public struct MortonDecodeJob_Packed : IJob
        [ReadOnly] public NativeArray<uint4> Codes;
        public NativeArray<uint4x3> Coordinates;
        public void Execute()
            for (int i = 0; i < Codes.Length; i++)
                Coordinates[i] = Morton.DecodeMorton3(Codes[i]);

    CompileSynchronously is only used in the editor to make sure the Burst code is compiled before this function is called, rather then using the non-burst code until it’s ready.

    I then fire everything off like this:

    var encodeJob = new MortonEncodeJob()
        Coordinates = m_Coordinates,
        Codes = m_Codes
    var decodeJob = new MortonDecodeJob()
        Codes = m_Codes,
        Coordinates = m_CoordinatesDecoded
    var encodeJobHandle = encodeJob.Schedule();
    var decodeJobHandle = decodeJob.Schedule(encodeJobHandle);

    I’m using the Performance Testing API to measure results. It has some nice features like warm-up and iterations.

    So I can run a test like this:


    So what’s the result? I ran through encoded and decoded one million coordinates. Without Burst it takes 205.39ms, the same code outside of the job takes 207.98, so there is some job overhead. With Burst on it takes 6.12ms. This is huge, exactly the same code runs 33x faster.

    For a great background on what Burst does and how it can achieve this speed up, check out - ECS - Deep Dive into the Burst Compiler.

    But we can do better. Inspired by Instrinsics: Low-level engine development with Burst, I learnt that we use “packed” numbers Burst can store more more data in registers in a single operation and also auto-vectorise the Unity.Mathematics operations.

    Out encoding function changes from public static uint EncodeMorton3(uint3 coordinate) to public static uint4 EncodeMorton3(uint4x3 coordinates). Or it could equally be public static uint EncodeMorton3(uint4 coordinateX, uint4 coordinateY, uint4 coordinateZ ). I just choose the former because I can easily change an array of uint3s to an array of uint4x3 by type punning like this m_Coordinates.Reinterpret<uint3x4>(UnsafeUtility.SizeOf<uint3>()) and then transposing each of the elements.

    Assembly is not my strong point. But what I’m looking for is moving data to registers changing from:

    movsxd  r8, dword ptr [rcx + 8]


    movdqa  xmmword ptr [rsp + 80], xmm11
    movdqa  xmmword ptr [rsp + 64], xmm10
    movdqa  xmmword ptr [rsp + 48], xmm9
    movdqa  xmmword ptr [rsp + 32], xmm8
    movdqa  xmmword ptr [rsp + 16], xmm7
    movdqa  xmmword ptr [rsp], xmm6

    xmmword is a SIMD data type with 128 bits (16bytes), which is exactly the size of a uint4x3.

    Here is the first couple of lines of the morton encoding:

    x = (x ^ (x << 16)) & 0xff0000ff; // x = ---- --98 ---- ---- ---- ---- 7654 3210
    x = (x ^ (x << 8)) & 0x0300f00f;  // x = ---- --98 ---- ---- 7654 ---- ---- 3210

    In the unpacked version we see this:

    shl     ecx, 16
    or      ecx, edx
    shl     eax, 8
    and     eax, 61440
    and     ecx, 50331663
    or      ecx, eax

    In the packed version, which is exactly the same code, but with a unit4 passed in instead of a uint.

    pslld   xmm6, 16
    por     xmm6, xmm0
    pslld   xmm4, 8
    pand    xmm4, xmm1
    pand    xmm6, xmm2
    por     xmm6, xmm4

    Here we get packed versions of the same operations. So we get 4 operations for the price of 1.

    This brings the test time from 6.12ms to 3.06ms, another 2x speed up.

    I thought I’d also try to use IJobParallelFor (batch size 32) to see if I could get a further speed up by splitting this across cores. This increased the time taken to 151.61ms. Still have the time of not using Burst at all, but much worse then just turing Burst on.

    The big question now is data layout for points. Either an array each for x,y,z, or using the interleaved uint4x3 I’ve been using in these tests. The interleave data is more complicated to work with, but for very large data sets I like the idea of not having to maintain indices that fit beyond the bounds of an array.

  • Morton Order - Introduction

    Morton Order

    For the out-of-core point cloud renderer I’m working on I need a way to convert a node coordinate into an array index. The idea is all nodes in a layer and children of a node should be in contiguous memory for easy access and cache friendly processing.

    In my first implementation I used Hilbert curves. They guarantee each array index will be a next-door neighbour node, but they are reasonably complex to calculate. Morton order guarantees the 8 children of a node will be contiguous, but there may be discontinuities between those blocks.

    What is Morton Order (Z-Order, Lebesgue Curve)

    To store an Octree node in an array, we need a way to convert its 3D coordinate to a 1D array index.

    Let’s take the simple case of the first level of the Octree, with 8 nodes. Each axis will be in either position 0 or 1. We can use one bit of each of these axis, or a total of 3 bits to encode each node. As enumerate each node, these bits increment just like normal binary numbers. They can be used as an integer index into an array.

    node  | bits | index
    0,0,0 |  000 |    0
    0,0,1 |  001 |    1
    0,1,0 |  010 |    2
    0,1,1 |  011 |    3
    1,0,0 |  100 |    4
    1,0,1 |  101 |    5
    1,1,0 |  110 |    6
    1,1,1 |  111 |    7

    For larger coordinates we continue to have sets of 3 bits for each additional digit on each axis and the Z pattern repeats.

    Morton order is perfect for an Octree, the 8 children of a node are guaranteed to be in contiguous memory.

    This blog post from Asger Hoedt gives a great introduction to Morton order.

    The Octree for the point cloud can have data in non-leaf nodes. So each level of the Octree will have it’s own array, filled with nodes for that level.

    Encoding and Decoding

    To make a morton code, we need to get each of the 3 axes and interleave their bits. Asger Hoedt’s post above gives a method which repeatedly shifts a number up and masks out the bit we don’t want. Jeroen Baert gives 3 more methods. The second method is similar to Asger’s “Magic Bits” method.

    Fabian “ryg” Giesen also mentions the same method as Asger Hoedt and gives an implementation for a 3D encoding.

    A follow up blog post shows that the third method, using a LUT, is faster. But Asger/ryg’s method is more readable and I believe, more open to Burst performance optimisations.

    Here is the code. I still have to add limits checking and Burst friendly versions.


  • Infinite Points - Introduction

    Infinite Points - Introduction

    It’s becoming more and more common to use photogrammetry and lidar scanning to capture buildings, engineering projects, sites of cultural significance, archaeological digs, etc. Most of these points clouds are in the billions of points; slow to render and unable to fit into memory.

    There are workflows for converting point cloud data into meshes, but the process is usually laborious and data is lost in the process. For digs or remote inspections keeping all the point cloud data is very important.

    I’d like to build an out-of-core point cloud renderer for Unity that solves these issues but keeping parts of the point cloud data on disk and reading in the most important data for the users viewpoint.


    • The data should be stored on disk in multiple files.
    • These files should be able to be loaded into memory at runtime.
    • The data should be laid out efficiently in memory to allow for quick processing, eg for culling.

    Previous Work


    The data will be stored in an Octree. Rather then nodes containing pointers to their children, each layer of the Octree is stored in a sparse array. This mean nodes can be stored in a tightly packed array, ready for processing.

    Level 0 of the Octree is a single node, level 1 has 8 nodes, level 2 has 64, etc.


    Nodes are stored in arrays, so we need a way to convert a node with a 3D coordinate into an array index, and back again. Node coordinates do no need to be explicitly stored, it’s inferred from the array index.

    The indexing should be contiguous, so children of a node are stored together in the array. This means children can be accessed using a simple index range and processed quickly without any cache misses.

    The most straightforward way to do this is a Morton order (Z-Order, Lebesgue Curve). A space filling curve.

    Sparse Array

    The number of nodes at each level grows very quickly. Assuming that lower levels will mostly have no nodes, we’ll use a sparse array. This is a classic solution of adding a layer of indirection. The nodes are stored together in one array and another array of the same size contains the sorted large indices.

    A single array lookup is a binary search of the index array. Efficient processing of contiguous memory is still possible by turning an index range of large indices into an index range of nodes that exist. The most common case would be to process (for eg culling) an entire level of nodes directly, without using the index indirection.

    Please follow along here:


  • Using Angular.js to view Melbourne's public art

    I’ve been wanting to learn more about Angular.JS so made a quick web page to view the location all of Melbourne’s public art on a Google Map.

    There is an amazing amount of open data on the web. The Australian government has been releasing its data through The dataset of all Melbourne’s public art seemed like a great place to start.

    I used the AngularJS Google Maps directives. It provdes a promise that gets called when Google Maps is loaded and ready to go. Then I fire off a http request to, with yet another promise that extracts the marker data and sets it the data bound markers array.

    I center the map over Melbourne with an appropraite zoom level and off we go.

    myApp.controller( "gMap",function( $scope, $http, uiGmapGoogleMapApi ) {
        .then(function(maps) {
            $ = { 
                center: { latitude: -37.8141, longitude: 144.9633 }, 
                zoom: 14 
            .then(function(response) {
                var marker =[0];
                $scope.markers =;

    The second piece is the markers themselves. Using ng-repeat directive, AngularJS will generate all the markers withut me having to write any imperative code.

    <ui-gmap-marker ng-repeat="marker in markers"
            coords="{ 'latitude': marker[19][1], 'longitude': marker[19][2] }" idkey="marker[0]">

    Now hard coding the column numbers of the data set isn’t the most robust solution, but it got the job done for this blog post!

    And finally here is page with the public art on a Google map.

    I was amazed at how easy this was to get up and running, and I hope to write some more AngularJS soon.

  • Using a Fluid Sim to Generate a Flow Map

    Fluid Simulation to Generate a Flow Map

    Here is a proof of concept of using a 2D fluid simluation to generate a Flow Map. This follows on from my Fluid/Flow Map ShaderToy experiment to do the same thing, except here I have a much better fluid simulation.

    Press the ‘f’ key to swap between the fluid simulation and the flow map. When the flow map is being displayed the fluid simulation stops.

    This could be used in a game to deal with dynamic obstacles in water or smoke, without having to run a full fluid simulation all the time. The fluid simulation could be amortised over many frames, leading to a cheap fluid effect.

    The implmentation is in ThreeJs/WebGL, check the source code of this page for details.

  • A fluid sim with flow map ShaderToy experiment.

    Following on from my previous Flow Map post, I wanted to see what flow maps would look like with maps generated in real time using a fluid sim.