Hello Caroline ( as it’s likely you who will answer me )
I’m working on Chapter 16, trying to understand it better. Indeed, I’ve successfully loaded an array of Floats, carried out a calculation on them and compared the result with a CPU map operation. I loaded my buffer with a [Float] of 1024 elements and used:
let threadsPerGroup = MTLSize(width: pipelineState.threadExecutionWidth, // always 32
height: 1,
depth: 1)
let threadsPerGrid = MTLSize(width: 1024, height: 1, depth: 1)
computeEncoder.dispatchThreads(threadsPerGrid,
threadsPerThreadgroup: threadsPerGroup)
The CPU wins the 1,024-element array speed test (~10 times faster) but the GPU is ~25 times faster when my array holds 1,000,000 elements. Clearly there is some set-up cost that needs to be covered before it is worth using the GPU. So far so good!!
I don’t really understand, however, how much work is going on in parallel on my Apple M2 Max. Is it 32 values at a time? Surely more?
And what does a 2D input look like. I thought 2D would just be a series of 1D inputs and since every cell is independent I cannot see why the GPU would even be interested in the shape of the array. I tried changing this line:
let threadsPerGrid = MTLSize(width: 64, height: 16, depth: 1)
but it crashes (with any height > 1) while this line:
let threadsPerGrid = MTLSize(width: 64, height: 1, depth: 1)
works but only processes the first 64 elements of the array.
I tried to loading a [[Float]] array of 16 [Floats] of length 64 and imitating your code as follows (2 thread groups wide by 2 high, I think):
let width = 32
let height = 8
let threadsPerThreadgroup = MTLSize(width: width, height: height, depth: 1)
let gridWidth = 64
let gridHeight = 16
let threadGroupCount = MTLSize(width: (gridWidth + width - 1) / width,
height: (gridHeight + height - 1) / height,
depth: 1)
computeEncoder.dispatchThreadgroups(threadGroupCount,
threadsPerThreadgroup: threadsPerThreadgroup)
it crashed with:
validateBuiltinArguments:755: failed assertion `component 1: 16 must be <= 1 for id [[ thread_position_in_grid ]]’
So I really do not have a clear idea of what is going on. I can’t even imagine if there are advantages to using multi-dimensions in terms of how much work is done in parallel.
Any help will be appreciated.