10. Computing

One of Metal's most important capabilities is compute kernels. When discussing alternatives, we can reference OpenCL, CUDA, OpenGL, or Vulkan compute operations. The key advantage of Metal is its deep integration with the entire Metal framework, allowing you to achieve exceptional performance by combining compute, rendering, and blitting operations. This isn't limited to computer graphics tasks - it applies to any computation that can be parallelized in a SIMD (Single Instruction, Multiple Data) manner. In this episode, I'll explain the basic principles of developing compute kernels.

Threads and threadgroups

To understand computing on Metal, first, we need to examine how compute kernels execute on the GPU.

A dispatched task is divided into threadgroups, which contain multiple threads. Each thread runs your compute function. The maximum size of a SIMD group, threadgroup, and the total number of threads depends on the hardware.

Within the same threadgroup, threads share fast memory and operate in a logically concurrent manner. This can be leveraged for more efficient computation. Threads in the same group run concurrently and can be synchronized using barrier mechanisms (threadgroup_barrier()).

This means that, logically, all threads run almost simultaneously, but physically, the GPU processes them in batches called SIMD groups. You can determine the size of the SIMD group using threadExecutionWidth. Aligning the size of your threadgroup to this value enhances performance.

More details can be found in the official documentation:

Kernel example

For better understanding, let's create a simple kernel example that fades the value of a pixel and applies a gradient map based on the value stored in the alpha channel:

kernel void krnApplyEffects(                                // (1)
    texture2d<float, access::read> in [[ texture(0) ]],     // (2)
    texture2d<float, access::write> out [[ texture(1) ]],   // (3)
    constant float4 *gradient [[ buffer(0) ]],              // (4)
    constant float &amount [[ buffer(1) ]],                 // (5)
    uint2 gid [[thread_position_in_grid]])                  // (6)
{
    int2 size(in.get_width(), in.get_height());             // (7)
    if (any(int2(gid) >= size)) { 
        return;
    }

    float4 res = in.read(gid);                              // (8)
    res.a = min(res.a * amount, res.a - 1.0/255.0);         // (9)
    res.rgb = gradient[int(res.a * 255)].rgb;               // (10)

    out.write(res, gid);                                    // (11)
}

Definition of the compute kernel.
Input texture (access::read).
Output texture (access::write).
A buffer containing the gradient.
Parameter for the fading amount.
The position of the thread (kernel call) in the grid. This can be 1D, 2D, or 3D - choose what suits your task. You can also similarly retrieve the thread group and thread index within the group.
Ensure the thread is within the bounds of the input texture. See the illustration below to understand why this check is necessary (red areas represent out-of-bounds access that could cause issues).
Read the value from the input texture at the given index. You could also sample it, which would require access::sample and a defined sampler.
Perform the fading operation, storing the result in the alpha channel.
Apply the gradient map based on the value in the alpha channel.
Write the result into the output texture.

NOTE:

Some devices support access::read_write. Check feature table or use MTLDevice.readWriteTextureSupport.

Certain devices allow nonuniform thread group sizes and can handle them automatically. In such cases, you can skip the check in step (7).

If you need to access a buffer, simply use [] for both reading and writing, as demonstrated with gradient in this example.

Encoder

I've already described the kernel part briefly in the episode about Metal Shading Language and the general Metal architecture in the episode about Metal Architecture. So, let's focus on some specific details. As we know, commands for the GPU are grouped by type into encoders:

if let encoder = commandBuffer.makeComputeCommandEncoder() {
    // Perform your compute operations here
    encoder.endEncoding()
}

Compute pipeline state

While for rendering, we can run an empty render encoder, which will still perform some actions (like clearing the buffer), using an empty compute encoder makes no sense. Instead, we must use a kernel wrapped in a compute pipeline state. Fortunately, creating a compute pipeline state is simpler than setting up a render pipeline state, at least on the CPU side. Let's encapsulate this process into a function:

func buildComputePipeline(device: MTLDevice, kernelName: String) -> MTLComputePipelineState? {
    let library = device.makeDefaultLibrary()                               // (1)
    let kernelFunction = library?.makeFunction(name: kernelName)            // (2)
    return try? device.makeComputePipelineState(function: kernelFunction!)  // (3)
}

Retrieve the default Metal library for the application.
Obtain the kernel function from the library using its name.
Create a compute pipeline state with the kernel on the given device.

NOTE: Similar to rendering functions, you can use function constants to adjust the logic of your kernel at this point.

Once we have a compute pipeline state, we can use it within the compute encoder:

if let encoder = commandBuffer.makeComputeCommandEncoder() {
    // ⬇ NEW CODE ⬇
    encoder.setComputePipelineState(pipeline)
    // ⬆ NEW CODE ⬆
    encoder.endEncoding()
}

Set up parameters

To make compute kernels useful, we need to pass at least one buffer for the output results (assuming we already have all the required buffers):

if let encoder = commandBuffer.makeComputeCommandEncoder() {
    encoder.setComputePipelineState(pipeline)
    // ⬇ NEW CODE ⬇
    encoder.setTexture(image, index: 0)                                     // (1)
    encoder.setTexture(image, index: 1)
    encoder.setBuffer(gradient, offset: 0, index: 0)                        // (2)
    encoder.setBytes(&amount, length: MemoryLayout<Float32>.size, index: 1) // (3)
    // ⬆ NEW CODE ⬆
    encoder.endEncoding()
}

Setting the same input and output texture is fine and shouldn't cause any conflicts or race conditions since our kernel performs an in-place, non-convolutional operation.
Explicitly setting a buffer with gradient information ensures the kernel has the required data to apply the gradient mapping.
Using setBytes to set the fading amount value provides a convenient way to pass a small parameter directly into the compute kernel.

Dispatching

All this doesn't make sense until we dispatch the kernel to the GPU for execution.

First, let's encapsulate the operation into a function and discuss it. Then, we'll explore the details.

func dispatch(                                                                  // (1)
    encoder: MTLComputeCommandEncoder,                                          
    size: MTLSize, 
    pipeline: MTLComputePipelineState
) {
    let width = pipeline.threadExecutionWidth                                   // (2)
    let height = pipeline.maxTotalThreadsPerThreadgroup / width                 // (3)
    let threadgroupSize = MTLSizeMake(width, height, 1)                         // (4)
    let threadgroupCount = MTLSizeMake(                                         // (5)
        (size.width  + threadgroupSize.width -  1) / threadgroupSize.width,
        (size.height + threadgroupSize.height - 1) / threadgroupSize.height,
        1)
    encoder.dispatchThreadgroups(                                               // (6)
        threadgroupCount, 
        threadsPerThreadgroup: threadgroupSize)
}

We need to use encoder because it dispatches the kernel, size to calculate how many threads are required to process the image, and pipeline to determine the maximum allowable thread group size.
Retrieve the recommended threadgroup width for the pipeline state.
Calculate the threadgroup height based on the maximum number of threads allowed in the group for the pipeline state.
Define the threadgroup size.
Calculate the number of threadgroups required to cover the entire workload.
Dispatch the specified number of threadgroups to the GPU (when encoder is delivered to GPU) for execution.

NOTE: If your device supports non-uniform thread group sizes, you can use dispatchThreads. In that case, calculating threadgroupCount is unnecessary.

And finally, in the encoder:

if let encoder = commandBuffer.makeComputeCommandEncoder() {
    encoder.setComputePipelineState(pipeline)
    encoder.setTexture(image, index: 0)
    encoder.setTexture(image, index: 1)
    encoder.setBuffer(gradient, offset: 0, index: 0)
    encoder.setBytes(&amount, length: MemoryLayout<Float32>.size, index: 1)
    // ⬇ NEW CODE ⬇
    dispatch(
        encoder: encoder, 
        size: MTLSize(width: image.width, height: image.height, depth: 1),
        pipeline: pipeline)
    // ⬆ NEW CODE ⬆
    encoder.endEncoding()
}

Conclusion

Metal's compute kernels empower efficient GPU computation for diverse tasks, extending far beyond graphics rendering.
Metal's seamless integration within its framework facilitates smooth transitions between compute, rendering, and blit operations.
Optimizing performance requires a solid understanding of threadgroups, SIMD groups, and synchronization techniques.
Setting up the compute encoder and pipeline state is straightforward compared to rendering, though dispatching threads requires attention to detail.