Chapter 16: handling 2D and 3D inputs

Hello Caroline ( as it’s likely you who will answer me :smile: )

I’m working on Chapter 16, trying to understand it better. Indeed, I’ve successfully loaded an array of Floats, carried out a calculation on them and compared the result with a CPU map operation. I loaded my buffer with a [Float] of 1024 elements and used:

  let threadsPerGroup = MTLSize(width: pipelineState.threadExecutionWidth,  // always 32
                                height: 1,
                                depth: 1)
  let threadsPerGrid = MTLSize(width: 1024, height: 1, depth: 1)
  computeEncoder.dispatchThreads(threadsPerGrid,
                                 threadsPerThreadgroup: threadsPerGroup)

The CPU wins the 1,024-element array speed test (~10 times faster) but the GPU is ~25 times faster when my array holds 1,000,000 elements. Clearly there is some set-up cost that needs to be covered before it is worth using the GPU. So far so good!!

I don’t really understand, however, how much work is going on in parallel on my Apple M2 Max. Is it 32 values at a time? Surely more?
And what does a 2D input look like. I thought 2D would just be a series of 1D inputs and since every cell is independent I cannot see why the GPU would even be interested in the shape of the array. I tried changing this line:

  let threadsPerGrid = MTLSize(width: 64, height: 16, depth: 1)

but it crashes (with any height > 1) while this line:

  let threadsPerGrid = MTLSize(width: 64, height: 1, depth: 1)

works but only processes the first 64 elements of the array.

I tried to loading a [[Float]] array of 16 [Floats] of length 64 and imitating your code as follows (2 thread groups wide by 2 high, I think):

  let width = 32
  let height = 8
  let threadsPerThreadgroup = MTLSize(width: width, height: height, depth: 1)
  let gridWidth = 64
  let gridHeight = 16
  let threadGroupCount = MTLSize(width: (gridWidth + width - 1) / width,
                                 height: (gridHeight + height - 1) / height,
                                 depth: 1)
  computeEncoder.dispatchThreadgroups(threadGroupCount,
                                      threadsPerThreadgroup: threadsPerThreadgroup)

it crashed with:
validateBuiltinArguments:755: failed assertion `component 1: 16 must be <= 1 for id [[ thread_position_in_grid ]]’

So I really do not have a clear idea of what is going on. I can’t even imagine if there are advantages to using multi-dimensions in terms of how much work is done in parallel.

Any help will be appreciated.

Yes, it’s me :slight_smile:

Correct. It’s not always better to use the GPU. Even if you’re parallel processing, sometimes it’s better to synchronise threads on the CPU, due to the overhead of setting up.

A threadgroup is made up of threads. pipelineState.maxTotalThreadsPerThreadgroup gives you this number. On my Apple GPU devices it’s 1024.

The pipeline state object’s threadExecutionWidth is the number of threads that the GPU executes simultaneously on a single compute unit. For all my devices, the thread execution width is 32. If you’re googling this, Nvidia calls it warp size and AMD calls it wavefront.

From Wikipedia on Apple M1:

The M1 integrates an Apple designed eight-core (seven in some base models) graphics processing unit (GPU). Each GPU core is split into 16 Execution Units, which each contain eight Arithmetic Logic Units (ALUs). In total, the M1 GPU contains up to 128 Execution units or 1024 ALUs, which Apple says can execute up to 24,576 threads simultaneously.

However, Apple doesn’t publish information about how its GPU works, so :woman_shrugging:.

But the GPU is running lots of groups of 32 threads simultaneously.

As for the dimensions, they should reflect what your data is. For a buffer, you’d use one dimension, and for a 2d texture you’d use two dimensions. You then use a correctly dimensioned id in the kernel function to access the thread.

If you really do have a 2d array that you want to put into a metal buffer, you should convert it into a 1d array. This SO answer has some code to illustrate: objective c - How do I set the contents of a Metal Buffer as a 2D array? - Stack Overflow

Further things:

Apple GPUs are optimised for 16-bit data types. For ease of learning, we used float on the GPU, but we should be using half wherever possible. And short over int. As the number of threads would fit in a 16-bit type, the kernel function convert_mesh should be using ushort id.

32 threads take the same action in parallel. If you have conditionals in those 32 threads, and some threads need to do something different, then all threads are branched and serialised.

There is also a “working set limit” for a single command encoder. This is the total amount of memory that can be allocated on the GPU at any one time. You can discover this through device.recommendedMaxWorkingSetSize. On my iPhone 12 it’s about 4GB, and on my M1 32GB MacBook Pro, it’s about 22GB.

2 Likes

Hi Caroline :smile:
you’ve certainly guessed by now that I’m not a computer professional, just an old hacker playing around with something he barely understands. All the more difficult to explain why I bought a (38 core) Apple M2 Max since we last discussed bugs with my 2015 machine. It’s way above my pay grade :smile: .

Anyhow, the relevant Wiki entry says that the M2 Max GPU contains up to 608 execution units or 4864 ALUs so I’m guessing that means maybe 100k threads simultaneously. Wow!

I’m not sure what you mean there !?

I had looked at pipelineState.maxTotalThreadsPerThreadgroup before I posted, imagining that this greater number (1024 on my machine too, thread execution width also 32) would be better but it changed nothing.

I have the same working set limits on my devices too.

I changed from Float to Float16 with some improvement but could not change uint to ushort because my indices go to 1 million.

Just why I need to tell my GPU to use 32 is not clear to me if in any case it’s handling tens of thousand simultaneously.

Sorry, I have no idea what that means. I thought all 1 million threads take the same action in parallel.

Finally, I’m sure you’ve better things to do than answer my silly questions. My hopes are that (i) some other readers benefit and (ii) it helps you gauge what (at least one) reader(s) want to find out more about. Many thanks for your patience.

We all have our strengths :slight_smile: - mine are more towards rendering pretty pictures than crunching data through compute parallel programming.

I just meant breaking your problem down into pieces and performing them concurrently on CPU threads:

https://docs.swift.org/swift-book/documentation/the-swift-programming-language/concurrency/

I think I’m going to have to come up with something that explains it all better. That won’t happen in the short term though. In the meantime, perhaps this article might help? It’s different hardware and different terminology, but the general concepts are the same.

Scroll down to the colour section where it has pictures of warps.

You dispatch your grid of threadgroups to the GPU. Each threadgroup is made up of SIMD-groups (threads). threadExecutionWidth gives 32, which is the SIMD width. In the article, this SIMD width is the warp.

All threads are performed simultaneously within the SIMD group. The article linked above shows what happens when there is divergence within a SIMD group (warp).

There are lots of execution units with multiple SIMD groups doing things approximately at the same time. On the Apple GPU, threadgroups are distributed across GPU cores.

1 Like

If you are doing a lot of compute work, you should also add a memory guard into your compute kernel, if it is not an exact fit to the kernel size.

The main reason for this is the GPU accepts the buffer, by storing the buffers start pointer, as a place to start work from. It then uses the stride of the element to pass each entry to a core for computation. The amount of indexs passed is equal to the thread count. So what can occur if the two do not match exactly, is you end up performing computation to memory that is outside of the buffer. With a guard in place you tell the gpu to ignore the threads that are greater then the array count.

The memory guard stops the GPU from performing calculations that are outside of the scope of your buffer size.

The way this works is if you are going to be adding 1 to a list of Integers.

  • The list contains 100 entries.
    You need to pass in the array count to the compute kernel, and check that the index id in the compute kernel is never greater then the memory guard size, if it is return.

This will keep your memory safe, and is something to keep in mind with Compute Kernels.

If I understand you correctly, you’re referring to paragraph 5.2.3.6 in the Metal Shading Language specification:

In Metal 2, the number of threads in the grid does not have to be a multiple of the number of threads in a threadgroup. It is therefore possible that the actual threadgroup size of a specific threadgroup may be smaller than the threadgroup size specified in the dispatch. The [[threads_per_threadgroup]] attribute specifies the actual threadgroup size for a given threadgroup executing the kernel. The [[dispatch_threads_per_threadgroup]] attribute is the threadgroup size specified at dispatch.

It’s all very technical but I understand that to mean that if I submit 1000 datapoints with threadExecutionWidth = 32, one of my groups will only have 8 in it and the GPU will try to process 32. Is that correct? And if I use a guard, it will be tested on every datapoint. Would I be better fleshing out my data with 24 dummy points in the CPU and have the CPU ignore their output?

If it doesn’t impact your data you can pad it, and that would work like you mentioned.

I just did a test processing 0.5 million vertices, with the memory guard in the shader removed and with it added, and there seemed to be nearly no difference in performance.

So depending on how much more effort padding your buffers are, there might be a trade off that it is simpler to add the guard.

Hope that helps