Chapter 16: atomic_int

tchelyzt · December 6, 2023, 2:46am

Hi Caroline,
I’ve built a program which tries to use a kernel function and used chapter 16 to implement an atomic-int type. I followed your model carefully except that my version uses an array of atomic_ints. Here are the bones with most of the detail removed:

#include <metal_stdlib>
  using namespace metal;

  struct Thing {
    constant int *values;
  };

 kernel void simplified(constant Thing *things [[buffer(0)]],
                       device atomic_int *scores [[buffer(1)]],
                       uint thingIdx [[thread_position_in_grid]]) {

   int score = 0;

   for (int i = 0; i < 200 ; i++) {
      // things not used here but I believe this func will be called once for each Thing and loop 200 times
      score += 1;
      atomic_fetch_add_explicit(&scores[thingIdx], score, memory_order_relaxed);
   }
}

In summary, 256 (much more elaborate than shown) Things are passed in with an array of scores (Ints) and I want to update the scores table based on the Things. I’ve stripped it down here to ignore the Things and simply accumulate scores in a loop.

The thing is, when I return to the CPU, the scores remain at 0.

Here’s some of what the CPU does:

init() {
  ...
  thingBuffer = Model.Device.makeBuffer(length: MemoryLayout<Thing>.stride * 256)!
  var thingPointer = thingBuffer.contents().bindMemory(to: Thing.self, capacity: 256)
  
  for t in 0..<256 {
     thingPointer.pointee = things[t]
     thingPointer = thingPointer.advanced(by: 1)
  }
  
  scoreBuffer = Model.Device.makeBuffer(length: MemoryLayout<Int>.stride * 256)!
  scorePointer = scoreBuffer.contents().bindMemory(to: Int.self, capacity: 256)
  
  var pointer = scorePointer
  for p in 0..<256 {
     pointer.pointee = scores[p]
     pointer = pointer.advanced(by: 1)
  }
}

 func compute() {
   guard let commandBuffer = Model.CommandQueue.makeCommandBuffer(),
   let computeEncoder = commandBuffer.makeComputeCommandEncoder()
   else { return }
  
    computeEncoder.setBuffer(thingBuffer, offset: 0, index: 0)
    computeEncoder.setBuffer(scoreBuffer, offset: 0, index: 1)
  
    computeEncoder.setComputePipelineState(computePipelineState)
  
    let threadsPerGrid = MTLSize(width: 256, height: 1, depth: 1)
    let threadsPerGroup = MTLSize(width: computePipelineState.maxTotalThreadsPerThreadgroup, height: 1, depth: 1)
    computeEncoder.dispatchThreads(threadsPerGrid, threadsPerThreadgroup: threadsPerGroup)
  
    computeEncoder.endEncoding()      
    commandBuffer.commit()

    var pointer = scorePointer
    // score pointer was saved and never advanced
    // This loop returns all zeros
    for p in 0..<256 {
      print(pointer.pointee, terminator: ",")
      scores[p] = pointer.pointee
      pointer = pointer.advanced(by: 1)
   }
}

Am I missing something very obvious?

regards
Don

tchelyzt · December 6, 2023, 9:37am

I’m wrong. This doesn’t even need an atomic variable, each score is in a different thread. Even so, it doesn’t work with either atomic_int or int. I wonder why.

I changed to a simple int to no avail. Then I added :

commandBuffer.waitUntilCompleted()

and I got this result I can’t explain (86328842669700) in 128 scores and 0 in the remainder. I did a debug capture and the buffers hold what I expect them to hold except that Ints are 64 bit and when displayed as int in the buffer they are split in 2 numbers. I wonder are my GPU ints the same as my CPU Ints?

I moved the last loop into a completed handler thus:

commandBuffer.addCompletedHandler { [self] _ in
  var pointer = scorePointer
  for p in 0..<256 {
     print(pointer.pointee, terminator: ",")
     scores[p] = pointer.pointee
     pointer = pointer.advanced(by: 1)
  }
}

To my surprise itt is only called if I also call waitUntilCompleted(). Surely it should be always called!

caroline · December 7, 2023, 1:51am

I’m a little confused. Before I sort this out, please confirm.

You have a struct:

struct Thing {
  var scores = [Int](repeating: 0, count: 200)
}

var things = [Thing](repeating: Thing(), count: 256)

That would make an array of 256 Things. and each Thing contains an array of 200 scores.

Do you want to add up the scores within a Thing so you end up with 256 totals, or do you want to add up the scores over all Things to one total.

tchelyzt · December 7, 2023, 2:34am

First, let me answer my own question because I think I cracked.
An int in Metal is a 32 bit number
An Int in Swift is 64 bit (at least on my machine)

I rewrote the example at Performing Calculations on a GPU | Apple Developer Documentation which actually uses Floats and it worked. Then I changed it to Ints and it failed. When I changed it to Int32s, everything worked.

Alternatively, changing this line:

      let gridSize = MTLSize(width: bufferSize, height: 1, depth: 1)

to

  let gridSize = MTLSize(width: bufferSize * 2, height: 1, depth: 1)

also works. I find it curious that the width parameter is a count of array elements rather than a .size or .stride but anyhow the secret is to know the difference between Swift and Metal.

As to your question, no, Thing does not contain scores, just other stuff I left out which were not needed to illustrate the problem. scores is a separate array provided in Buffer[1]. Perhaps I should have included the Metal function. Sorry. In the real version, Thing is used to calculate each score.

kernel void simplified(constant Thing *things [[buffer(0)]],
                       device int *scores [[buffer(1)]],
                       uint thingIdx [[thread_position_in_grid]]) {

   int score = 0;

   for (int i = 0; i < 200 ; i++) {
      // things not used here but this func will be called once for each Thing and loop 200 times
      score += 1;
      scores[thingIdx] += score;
   }
}

Anyhow, the problem has gone away, firstly because it’s not at all a problem about atomic variables (that was me panicking when my first tries failed) and secondly because all has become clear now that realise the difference between Int and int.

At least I hope so. Now I have to feed that conclusion back into my real program

Thanks for looking at it.

caroline · December 7, 2023, 3:26am

That width parameter has to be a count, because you’re describing the number of threads, not the size of a buffer.

I’m quite relieved that I don’t have to work out what might have been a two dimensional kernel (256 x 200 x 1)

tchelyzt · December 7, 2023, 3:48am

I’m still not out of the woods. Looks like the simplest way to change my program was to use the long type in Metal. Turns out my uniforms need to be long too (because I was too lazy to use a bridging header, see below)

It gets curiouser and curiouser:
If I make an enum:Int in swift, its rawValue has a stride of 8 as I’d expect
If I put it in the bridging header, its rawValue has a stride of 4. It seems to know that Metal will treat it as 32 bit.

caroline · December 7, 2023, 10:59am

The Metal data types are listed on page 22 of the Metal Shading Language specification

64-bit integer math is a recent addition according to the Metal Feature Set document. I haven’t tried it, but I guess it works .

tchelyzt · December 7, 2023, 11:47am

Yes, I’ve been in there. That’s when everything started to click.
I also looked in the functions for a random number generator like in C++ but it doesn’t appear to have one.

caroline · December 7, 2023, 11:25pm

No, you can’t create random numbers on the GPU.

Chapter 27, “Rendering With Rays” has a noise function for 2d random numbers. Otherwise, this stackoverflow has some good suggestions:

tchelyzt · January 6, 2024, 9:08am

Hi Caroline,
With reference to the above, here’s a very basic random number generator that I’ve developed to run in my Metal project. It might come in handy for you.

  //
  //  RandomNumberGenerator.h
  //  Robby
  //
  //  Created by Don McBrien on 24/12/2023.
  //

  #ifndef RandomNumberGenerator_h
  #define RandomNumberGenerator_h

  class RNG {
  public:
     uint s;
     uint a = 16807;          // pow(7,5)
     uint m = 2147483647;     // pow(2,31)-1
     
     RNG(float seed) {
        // seed with a Float.random(in: 0..<1) delivered from the CPU
        s = uint(seed * float(m));
     }
     
     float next() {
        // return a new random number in float(0.0..<1.0)
        s = (a * s) % m;
        return float(s) / float(m);
     }
     
     uint next(uint upto) {
        // return a new random number in uint(0..<upto)
        return uint(next() * float(upto));
     }
  };

  #endif /*RandomNumberGenerator_h */

To use it I generate a seed (per thread) using Swift in the CPU as follows:

var seeds = (0..<Constants.ThreadCount).map { _ in Float.random(in: 0..<1)

… and load them in a buffer called seeds. Then in each Metal thread I initialise a RNG with

   RNG rng(seeds[threadIndex]);           // obtain a RNG

Finally I can get as many random number as I like with calls such as:

 uint ruint = rng.next(4);       // uint 0,1,2 or 3
 float rfloat = rng.next();       //float  between 0,0 and 1.0

I’ve tested it carefully and I believe it works well. Nevertheless it unearthed another problem with Metal enums. I had this enum:

  typedef enum {
     north = 0,
     east,
     south,
     west
  } Cardinals;

and I wanted to say:

Cardinals  newDirection = Cardinals(rng.next(4));

it failed although E, S and W do have the appropriate values. When I changed the enum to:

  typedef enum {
     north = 0,
     east = 1,
     south = 2,
     west = 3
  } Cardinals;

it worked! There are so many tripwires in Metal or C++ and debugging is such a pain.

Happy New Year.

caroline · January 6, 2024, 11:27pm

Thank you for posting that!