One of Metal's most important capabilities is compute kernels. When discussing alternatives, we can reference OpenCL, CUDA, OpenGL, or Vulkan compute operations. The key advantage of Metal is its deep integration with the entire Metal framework, allowing you to achieve exceptional performance by combining compute, rendering, and blitting operations. This isn't limited to computer graphics tasks - it applies to any computation that can be parallelized in a SIMD (Single Instruction, Multiple Data) manner. In this episode, I'll explain the basic principles of developing compute kernels.
To understand computing on Metal, first, we need to examine how compute kernels execute on the GPU.
A dispatched task is divided into threadgroups, which contain multiple threads. Each thread runs your compute function. The maximum size of a SIMD group, threadgroup, and the total number of threads depends on the hardware.
Within the same threadgroup, threads share fast memory and operate in a logically concurrent manner. This can be leveraged for more efficient computation. Threads in the same group run concurrently and can be synchronized using barrier mechanisms (threadgroup_barrier()).
This means that, logically, all threads run almost simultaneously, but physically, the GPU processes them in batches called SIMD groups. You can determine the size of the SIMD group using threadExecutionWidth. Aligning the size of your threadgroup to this value enhances performance.

More details can be found in the official documentation:
For better understanding, let's create a simple kernel example that fades the value of a pixel and applies a gradient map based on the value stored in the alpha channel:
kernel void krnApplyEffects( // (1)
texture2d<float, access::read> in [[ texture(0) ]], // (2)
texture2d<float, access::write> out [[ texture(1) ]], // (3)
constant float4 *gradient [[ buffer(0) ]], // (4)
constant float &amount [[ buffer(1) ]], // (5)
uint2 gid [[thread_position_in_grid]]) // (6)
{
int2 size(in.get_width(), in.get_height()); // (7)
if (any(int2(gid) >= size)) {
return;
}
float4 res = in.read(gid); // (8)
res.a = min(res.a * amount, res.a - 1.0/255.0); // (9)
res.rgb = gradient[int(res.a * 255)].rgb; // (10)
out.write(res, gid); // (11)
}
access::read).access::write).access::sample and a defined sampler.
NOTE:
- Some devices support
access::read_write. Check feature table or useMTLDevice.readWriteTextureSupport.- Certain devices allow nonuniform thread group sizes and can handle them automatically. In such cases, you can skip the check in step (7).
If you need to access a buffer, simply use [] for both reading and writing, as demonstrated with gradient in this example.
I've already described the kernel part briefly in the episode about Metal Shading Language and the general Metal architecture in the episode about Metal Architecture. So, let's focus on some specific details. As we know, commands for the GPU are grouped by type into encoders:
if let encoder = commandBuffer.makeComputeCommandEncoder() {
// Perform your compute operations here
encoder.endEncoding()
}
While for rendering, we can run an empty render encoder, which will still perform some actions (like clearing the buffer), using an empty compute encoder makes no sense. Instead, we must use a kernel wrapped in a compute pipeline state. Fortunately, creating a compute pipeline state is simpler than setting up a render pipeline state, at least on the CPU side. Let's encapsulate this process into a function:
func buildComputePipeline(device: MTLDevice, kernelName: String) -> MTLComputePipelineState? {
let library = device.makeDefaultLibrary() // (1)
let kernelFunction = library?.makeFunction(name: kernelName) // (2)
return try? device.makeComputePipelineState(function: kernelFunction!) // (3)
}
NOTE: Similar to rendering functions, you can use function constants to adjust the logic of your kernel at this point.
Once we have a compute pipeline state, we can use it within the compute encoder:
if let encoder = commandBuffer.makeComputeCommandEncoder() {
// ⬇ NEW CODE ⬇
encoder.setComputePipelineState(pipeline)
// ⬆ NEW CODE ⬆
encoder.endEncoding()
}
To make compute kernels useful, we need to pass at least one buffer for the output results (assuming we already have all the required buffers):
if let encoder = commandBuffer.makeComputeCommandEncoder() {
encoder.setComputePipelineState(pipeline)
// ⬇ NEW CODE ⬇
encoder.setTexture(image, index: 0) // (1)
encoder.setTexture(image, index: 1)
encoder.setBuffer(gradient, offset: 0, index: 0) // (2)
encoder.setBytes(&amount, length: MemoryLayout<Float32>.size, index: 1) // (3)
// ⬆ NEW CODE ⬆
encoder.endEncoding()
}
setBytes to set the fading amount value provides a convenient way to pass a small parameter directly into the compute kernel.All this doesn't make sense until we dispatch the kernel to the GPU for execution.
First, let's encapsulate the operation into a function and discuss it. Then, we'll explore the details.
func dispatch( // (1)
encoder: MTLComputeCommandEncoder,
size: MTLSize,
pipeline: MTLComputePipelineState
) {
let width = pipeline.threadExecutionWidth // (2)
let height = pipeline.maxTotalThreadsPerThreadgroup / width // (3)
let threadgroupSize = MTLSizeMake(width, height, 1) // (4)
let threadgroupCount = MTLSizeMake( // (5)
(size.width + threadgroupSize.width - 1) / threadgroupSize.width,
(size.height + threadgroupSize.height - 1) / threadgroupSize.height,
1)
encoder.dispatchThreadgroups( // (6)
threadgroupCount,
threadsPerThreadgroup: threadgroupSize)
}
encoder because it dispatches the kernel, size to calculate how many threads are required to process the image, and pipeline to determine the maximum allowable thread group size.NOTE: If your device supports non-uniform thread group sizes, you can use
dispatchThreads. In that case, calculatingthreadgroupCountis unnecessary.
And finally, in the encoder:
if let encoder = commandBuffer.makeComputeCommandEncoder() {
encoder.setComputePipelineState(pipeline)
encoder.setTexture(image, index: 0)
encoder.setTexture(image, index: 1)
encoder.setBuffer(gradient, offset: 0, index: 0)
encoder.setBytes(&amount, length: MemoryLayout<Float32>.size, index: 1)
// ⬇ NEW CODE ⬇
dispatch(
encoder: encoder,
size: MTLSize(width: image.width, height: image.height, depth: 1),
pipeline: pipeline)
// ⬆ NEW CODE ⬆
encoder.endEncoding()
}