Cooperative Compute Shader


I am searching for an example how to create a compute shaders so that the results of each thread group can be analyzed and compared after execution.

I am unsure how to set this up. In my use case I need to iterate over an memory area of 800x600 created by procedural 2D SDFs to find the point of the largest distance into the shape (no textures used). How to split this up over several thread groups ? I have it running on one thread but it’s way to slow.


You can access shared memory with the threadgroup qualifier. Check out this sample code: 2D array for sum optimization · Issue #1 · mateuszbuda/GPUExample · GitHub

Nice! ThanKs for the link. How would I setup the shared memory in Swift ?

You set the grid, the groups and dispatch the threads as you’ve seen in ch 16 and then simply declare and use shared memory as in that example I linked to.

Its working fine, thanks a lot for your help. Had to read a bit about parallel reduction :slight_smile:

The only problem I have is that I have to limit the threads per threadgroup to (1,1), otherwise I get on my iPad an error:

Execution of the command buffer was aborted due to an error during execution. Caused GPU Hang Error (IOAF code 3)**

On the desktop it always works fine.

can I see how you dispatch the threads (group size and grid size) and also the kernel signature (just the first line)?

Below is my function where I calculate the sizes, I limit the threads for my parallel compute shader via limitThreads right now. Thanks.

func calculateThreadGroups(_ state: MTLComputePipelineState, _ encoder: MTLComputeCommandEncoder,_ width: Int,_ height: Int, store: Bool = false, limitThreads: Bool = false)
let w = limitThreads ? 1 : state.threadExecutionWidth
let h = limitThreads ? 1 : state.maxTotalThreadsPerThreadgroup / w
let threadsPerThreadgroup = MTLSizeMake(w, h, 1)

    let threadsPerGrid = MTLSize(width: width, height: height, depth: 1)
    encoder.dispatchThreads(threadsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)

    let threadgroupsPerGrid = MTLSize(width: (width + w - 1) / w, height: (height + h - 1) / h, depth: 1)
    encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)

The signature is

    kernel void diskBuilder(constant DISK_BUILDER_DATA *diskBuilderData [[ buffer(1) ]],
                                    device float *out [[ buffer(0) ]],
                                            uint2 id [[ thread_position_in_grid ]],
                                            uint tid [[ thread_index_in_threadgroup ]],
                                            uint2 bid [[ threadgroup_position_in_grid ]],
                                            uint2 blockDim [[ threads_per_threadgroup ]])


A few remarks:

  • why are you dispatching twice?
  • do you call endEncoding() anywhere?
  • try with a 2D uint2 tid [[ thread_index_in_threadgroup ]] as well, see what happens.
  • why are you dispatching twice?

I thought I had to call dispatchThreads to set the threads per grid, is that not needed when I call dispatchThreadgroups ?

  • try with a 2D uint2 tid [[ thread_index_in_threadgroup ]] as well, see what happens.

program_source:147:54: error: type ‘uint2’ (vector of 2 ‘unsigned int’ values) is not valid for attribute ‘thread_index_in_threadgroup’

btw; a general question. My shaders are all procedural, i.e. I combine source code snippets which I adjust on runtime and than compile on the fly. Now, the disadvantage of doing that is that I don’t have metal syntax highlighting and using the “”" operator is also a pain because XCode adjusts the indention depth all the time.

I cannot put the snippets into .metal files as they get compiled and they are not complete shaders. Is there a different way todo this ?

Thanks again for all the help.

If you dispatch twice, you do the work twice and only the last result is saved. I would keep only the first dispatch.

I do now know of a different way to do it. I am always putting GPU code in .metal files.

Ok, thanks. Will do that re dispatching (does not change the iPad error though). Will just keep the limiter in place right now.

what’s the A chip generation in your iPad?

Its a new iPad Air so A12 Bionic, but when you search for the error message you find lots of comments from people who have the same error on iOS12 and think its a bug on Apples side.