Cooperative Compute Shader

markusm · May 1, 2019, 8:07am

Hi

I am searching for an example how to create a compute shaders so that the results of each thread group can be analyzed and compared after execution.

I am unsure how to set this up. In my use case I need to iterate over an memory area of 800x600 created by procedural 2D SDFs to find the point of the largest distance into the shape (no textures used). How to split this up over several thread groups ? I have it running on one thread but it’s way to slow.

Thanks

mhorga · May 1, 2019, 10:36am

You can access shared memory with the threadgroup qualifier. Check out this sample code: 2D array for sum optimization · Issue #1 · mateuszbuda/GPUExample · GitHub

markusm · May 1, 2019, 11:15am

Nice! ThanKs for the link. How would I setup the shared memory in Swift ?

mhorga · May 1, 2019, 1:40pm

You set the grid, the groups and dispatch the threads as you’ve seen in ch 16 and then simply declare and use shared memory as in that example I linked to.

markusm · May 2, 2019, 6:45am

Its working fine, thanks a lot for your help. Had to read a bit about parallel reduction

markusm · May 3, 2019, 12:48am

The only problem I have is that I have to limit the threads per threadgroup to (1,1), otherwise I get on my iPad an error:

Execution of the command buffer was aborted due to an error during execution. Caused GPU Hang Error (IOAF code 3)**

On the desktop it always works fine.

mhorga · May 3, 2019, 2:10am

can I see how you dispatch the threads (group size and grid size) and also the kernel signature (just the first line)?

markusm · May 3, 2019, 2:15am

Below is my function where I calculate the sizes, I limit the threads for my parallel compute shader via limitThreads right now. Thanks.

func calculateThreadGroups(_ state: MTLComputePipelineState, _ encoder: MTLComputeCommandEncoder,_ width: Int,_ height: Int, store: Bool = false, limitThreads: Bool = false)
{
let w = limitThreads ? 1 : state.threadExecutionWidth
let h = limitThreads ? 1 : state.maxTotalThreadsPerThreadgroup / w
let threadsPerThreadgroup = MTLSizeMake(w, h, 1)

    let threadsPerGrid = MTLSize(width: width, height: height, depth: 1)
    
    encoder.dispatchThreads(threadsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)

    let threadgroupsPerGrid = MTLSize(width: (width + w - 1) / w, height: (height + h - 1) / h, depth: 1)
            
    encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
    
}

markusm · May 3, 2019, 2:18am

The signature is

    kernel void diskBuilder(constant DISK_BUILDER_DATA *diskBuilderData [[ buffer(1) ]],
                                    device float *out [[ buffer(0) ]],
                                            uint2 id [[ thread_position_in_grid ]],
                                            uint tid [[ thread_index_in_threadgroup ]],
                                            uint2 bid [[ threadgroup_position_in_grid ]],
                                            uint2 blockDim [[ threads_per_threadgroup ]])

    {

mhorga · May 3, 2019, 2:39am

A few remarks:

why are you dispatching twice?
do you call endEncoding() anywhere?
try with a 2D uint2 tid [[ thread_index_in_threadgroup ]] as well, see what happens.

markusm · May 3, 2019, 3:07am

why are you dispatching twice?

I thought I had to call dispatchThreads to set the threads per grid, is that not needed when I call dispatchThreadgroups ?

try with a 2D uint2 tid [[ thread_index_in_threadgroup ]] as well, see what happens.

program_source:147:54: error: type ‘uint2’ (vector of 2 ‘unsigned int’ values) is not valid for attribute ‘thread_index_in_threadgroup’

btw; a general question. My shaders are all procedural, i.e. I combine source code snippets which I adjust on runtime and than compile on the fly. Now, the disadvantage of doing that is that I don’t have metal syntax highlighting and using the “”" operator is also a pain because XCode adjusts the indention depth all the time.

I cannot put the snippets into .metal files as they get compiled and they are not complete shaders. Is there a different way todo this ?

Thanks again for all the help.

mhorga · May 3, 2019, 3:21am

If you dispatch twice, you do the work twice and only the last result is saved. I would keep only the first dispatch.

I do now know of a different way to do it. I am always putting GPU code in .metal files.

markusm · May 3, 2019, 3:25am

Ok, thanks. Will do that re dispatching (does not change the iPad error though). Will just keep the limiter in place right now.

mhorga · May 3, 2019, 3:37am

what’s the A chip generation in your iPad?

markusm · May 3, 2019, 3:41am

Its a new iPad Air so A12 Bionic, but when you search for the error message you find lots of comments from people who have the same error on iOS12 and think its a bug on Apples side.