Best data layout in threadgroup memory

Is threadgroup memory in Metal similar to shared memory in CUDA? Is there a similar bank conflicts issue in Metal? I try to optimize the memory access in threadgroup memory.


I do not know much about the CUDA bank conflicts but the hierarchy is similar to that in Metal: you have global (device/constant), shared (threadgroup) and local (thread) memory. You can read more in the MSL specification on pages 56 and 126. I hope this helps.

It’s about data layout inside threadgroup memory to reduce load/store instructions of accessing elements of an array

I see… did you find the doc useful?

No, the doc doesn’t talk anything about that. But I asked some Apple GPU engineer and here’s the answer:

consecutive accesses should always be good. i.e. each thread k should access element at k. The accesses should be coalesced between neighboring threads.

That’s space locality, a parallel programming basic concept, but that did not answer your question about the similarity you see in CUDA with bank conflicts. I recommend reading any of the books out there about parallel programming because they are not hardware specific, so any concept you learn there is universal for all APIs.

The Apple GPU Engineer doesn’t mention CUDA bank conflicts, here’s the quote

The answer is complicated for reasons I can’t go into, but a good rule of thumb is to have threads in a warp access consecutive elements, ideally 16B per element and no smaller than 4B.

And then, I asked him about 2B elements, he said “consecutive accesses should always be good”