Sending arrays to a kernel shader

tchelyzt · December 14, 2023, 7:57pm

I cannot get an array of [Int32] into my kernel shader.

Using a bridging header I can get in an array of structs like

typedef struct {
   int ar[100];
} MyArray;

kernel void myShader(device MyArray *in [[buffer(0)]],
                     device MyArray *out [[buffer(1)]],
                     uint threadIndex [[thread_position_in_grid]])

and then indexing with threadIndex I can change that array and get it back to the CPU, but …

… Swift sees, not an array but: a TUPLE!

Actually, it’s worse than that. To initialise my input I had to provide a tuple too. 100 elements!

Since I want to work with a large array, this is hopeless. I cannot figure how to build a tuple from an array and the thought of initialising a huge tuple is … well, you can imagine.

Instead, I have tried abandoning the bridging header and using a Swift struct and redefining it in the shader as

struct MyArray {
   var ar:[Int32]
}

and

typedef struct {
   int ar[100];
} MyArray;

The shader doesn’t complain but the Swift code crashes if I try to change the indices in the array. For instance

out[threadIndex] = in[threadIndex];            // copy array

works, but

int index = 99-threadIndex;
out[index] = in[threadIndex];      // reverse array

crashes.

Can anyone explain this or explain how to handle the tuple problem?

caroline · December 15, 2023, 12:48am

I would love to see a small project that illustrates this, and how you are handling it on the Swift side.

Personally, I would try and take the Int32 array out of the MyArray array, and treat it as a 2d Int32 array.

But I haven’t written code on this yet, as I would like to see what your Swift side is doing.

Presumably MyArray holds other things than just ar?

tchelyzt · December 15, 2023, 1:58am

Of course the real project is much more complex, but here is a minimal command line project which takes an array [[Int32]] and reverses both the inner and outer arrays. It prints the input followed by the output.

There are three parts:
main.swift
Shaders.metal
Bridge.h

As you can see, I needed to set up the input using buffers.

MAIN.SWIFT

// Main.swift
// if you uncomment the repeat loop in Model.compute() you can interrupt the flow
// to see the buffer contents.
 import Foundation
 import MetalKit

class Model {
   static let Device: MTLDevice = MTLCreateSystemDefaultDevice()!
   static let CommandQueue: MTLCommandQueue = Device.makeCommandQueue()!

   var mslInBuffer: MTLBuffer
   var mslOutBuffer: MTLBuffer

   var mslInPointer: UnsafeMutablePointer<MSLArray>
   var mslOutPointer: UnsafeMutablePointer<MSLArray>
  
   let bufferSize = 20
   private var computePipelineState: MTLComputePipelineState!

   init() {
      computePipelineState = Model.makeComputePipelineState()
  
      mslInBuffer = Model.Device.makeBuffer(length: MemoryLayout<MSLArray>.stride * bufferSize,      options: .storageModeShared)!
      mslOutBuffer = Model.Device.makeBuffer(length: MemoryLayout<MSLArray>.stride * bufferSize, options: .storageModeShared)!

      mslInPointer = mslInBuffer.contents().bindMemory(to: MSLArray.self, capacity: bufferSize)
      mslOutPointer = mslOutBuffer.contents().bindMemory(to: MSLArray.self, capacity: bufferSize)

      var mslIn = [MSLArray]()
      for index in 0..<bufferSize {
         var idx = Int32(index)
         let a = MSLArray(ar: (idx, idx+1, idx+2))
         mslIn.append(a)
         mslInPointer[index] = a
         idx = 0
         mslOutPointer[index] = MSLArray(ar: (idx, idx, idx))
      }
   }

   func compute() {
      print("Before:")
//      repeat {
         showResults()
         guard let commandBuffer = Model.CommandQueue.makeCommandBuffer(),
               let computeEncoder = commandBuffer.makeComputeCommandEncoder()
         else { return }
     
         computeEncoder.setComputePipelineState(computePipelineState)
     
         computeEncoder.setBuffer(mslInBuffer, offset: 0, index: 0)
         computeEncoder.setBuffer(mslOutBuffer, offset: 0, index: 1)

         let gridSize = MTLSize(width: bufferSize, height: 1, depth: 1)
     
         var threadGroupSize = computePipelineState.maxTotalThreadsPerThreadgroup
         if (threadGroupSize > bufferSize) { threadGroupSize = bufferSize }
         let threadsPerGroup = MTLSize(width: threadGroupSize, height: 1, depth: 1)
     
         computeEncoder.dispatchThreads(gridSize, threadsPerThreadgroup: threadsPerGroup)
     
         computeEncoder.endEncoding()
         commandBuffer.commit()
         commandBuffer.waitUntilCompleted()
         print("After:")
//      } while true
      showResults()
   }
}

 extension Model {
   func showResults() {
      func display<T>(name: String, ptr: UnsafeMutablePointer<T>) {
         var ar = [T]()
         var p = ptr
         for _ in 0..<bufferSize {
            ar.append(p.pointee)
            p = p.advanced(by: 1)
         }
         print("\(name): [", terminator: "")
         ar.forEach { print($0, terminator: ", ") }
         print("]\n")
      }
  
      display(name: "mslIn", ptr: mslInPointer)
      print()
  
      display(name: "mslIn reversed", ptr: mslOutPointer)
      print()
   }

   private static func makeComputePipelineState() -> MTLComputePipelineState {
      // Create Shaders
      guard let shaderLibrary = Model.Device.makeDefaultLibrary()
      else { fatalError("Failed to obtain shader library") }
      let kernelShader = shaderLibrary.makeFunction(name: "reverse_input")

      // Create Pipeline State
      do { return try Model.Device.makeComputePipelineState(function: (kernelShader!)) }
      catch { fatalError(error.localizedDescription) }
   }
}

 var model = Model()
 print("MSL Version, with tuple replacing array")
 model.compute()

SHADERS.METAL

 // Shaders.metal
 #include <metal_stdlib>
  using namespace metal;

 #include "Bridge.h"

  kernel void reverse_input(device MSLArray *mslIn [[buffer(0)]],
                            device MSLArray *mslOut [[buffer(1)]],
                            uint threadIndex [[thread_position_in_grid]])
 {
    int index = 19-threadIndex;

    for (int i=0; i<3; i++) {
       int j = 2-i;
       mslOut[index].ar[j] = mslIn[threadIndex].ar[i];
    }
 }

BRIDGE.H

 // Bridge.h
 #ifndef Bridge_h
 #define Bridge_h

  typedef struct {
  int ar[3];
 } MSLArray;

 #endif **/* Bridge_h */**

tchelyzt · December 15, 2023, 2:05am

Here’s the second version, only two files, no bridging header. Uncomment the secondlastt line in the shader and comment the last one to show where everything goes awry when we play around with the indices.

If this version worked, all our TUPLE problems would disappear.

SHADER.METAL

#include <metal_stdlib>
 using namespace metal;

 typedef struct {
   int ar[3];
} SwiftArray;

kernel void reverse_input(device SwiftArray *swiftIn [[buffer(0)]],
                          device SwiftArray *swiftOut [[buffer(1)]],
                          uint threadIndex [[thread_position_in_grid]])
{
   int index = 19-threadIndex;

   for (int i=0; i<3; i++) {
      int j = 2-i;
//      swiftOut[index].ar[j] = swiftIn[threadIndex].ar[i];
      swiftOut[threadIndex].ar[i] = swiftIn[threadIndex].ar[i];
   }
}

MAIN.SWIFT

 import Foundation
 import MetalKit

 struct SwiftArray {
   var ar:[Int32]
}

 class Model {
   static let Device: MTLDevice = MTLCreateSystemDefaultDevice()!
   static let CommandQueue: MTLCommandQueue = Device.makeCommandQueue()!

   var swiftInBuffer: MTLBuffer
   var swiftOutBuffer: MTLBuffer

   var swiftInPointer: UnsafeMutablePointer<SwiftArray>
   var swiftOutPointer: UnsafeMutablePointer<SwiftArray>
  
   let bufferSize = 20
   private var computePipelineState: MTLComputePipelineState!

   init() {
      computePipelineState = Model.makeComputePipelineState()
  
      swiftInBuffer = Model.Device.makeBuffer(length: MemoryLayout<SwiftArray>.stride * bufferSize, options: .storageModeShared)!
      swiftOutBuffer = Model.Device.makeBuffer(length: MemoryLayout<SwiftArray>.stride * bufferSize, options: .storageModeShared)!

      swiftInPointer = swiftInBuffer.contents().bindMemory(to: SwiftArray.self, capacity: bufferSize)
      swiftOutPointer = swiftOutBuffer.contents().bindMemory(to: SwiftArray.self, capacity: bufferSize)

      var swiftIn = [SwiftArray]()
      for index in 0..<bufferSize {
         var idx = Int32(index)
         let a = SwiftArray(ar: [idx, idx+1, idx+2])
         swiftIn.append(a)
         swiftInPointer[index] = a
         idx = 0
         swiftOutPointer[index] = SwiftArray(ar: [idx, idx, idx])
      }
   }

   func compute() {
      print("Before:")
      showResults()
      guard let commandBuffer = Model.CommandQueue.makeCommandBuffer(),
            let computeEncoder = commandBuffer.makeComputeCommandEncoder()
      else { return }
  
      computeEncoder.setComputePipelineState(computePipelineState)
  
      computeEncoder.setBuffer(swiftInBuffer, offset: 0, index: 0)
      computeEncoder.setBuffer(swiftOutBuffer, offset: 0, index: 1)

      let gridSize = MTLSize(width: bufferSize, height: 1, depth: 1)
  
      var threadGroupSize = computePipelineState.maxTotalThreadsPerThreadgroup
      if (threadGroupSize > bufferSize) { threadGroupSize = bufferSize }
      let threadsPerGroup = MTLSize(width: threadGroupSize, height: 1, depth: 1)
  
      computeEncoder.dispatchThreads(gridSize, threadsPerThreadgroup: threadsPerGroup)
  
      computeEncoder.endEncoding()
      commandBuffer.commit()
      commandBuffer.waitUntilCompleted()
      print("After:")
      showResults()
   }
}

 extension Model {
   func showResults() {
      func display<T>(name: String, ptr: UnsafeMutablePointer<T>) {
         var ar = [T]()
         var p = ptr
         for _ in 0..<bufferSize {
            ar.append(p.pointee)
            p = p.advanced(by: 1)
         }
         print("\(name): [", terminator: "")
         ar.forEach { print($0, terminator: ", ") }
         print("]\n")
      }
  
      display(name: "swiftIn", ptr: swiftInPointer)
      print()
  
      display(name: "swiftIn reversed", ptr: swiftOutPointer)
      print()
   }

   private static func makeComputePipelineState() -> MTLComputePipelineState {
      // Create Shaders
      guard let shaderLibrary = Model.Device.makeDefaultLibrary()
      else { fatalError("Failed to obtain shader library") }
      let kernelShader = shaderLibrary.makeFunction(name: "reverse_input")

      // Create Pipeline State
      do { return try Model.Device.makeComputePipelineState(function: (kernelShader!)) }
      catch { fatalError(error.localizedDescription) }
   }
}

 var model = Model()
 print("Swift Version, no bridging header")
 model.compute()

tchelyzt · December 15, 2023, 2:08am

I have repeatedly tried to do this but I end up with a shader parameter which requires an address space qualification and either cannot attach a buffer or knows nothing about indexes. Meaningless to me I’m afraid. Try it.

caroline · December 16, 2023, 4:42am

You are correct that Swift sees this as a tuple:

struct MyArray {
   var ar:[Int32]
}

Swift doesn’t support fixed length arrays, so you can’t have it in the bridging header like that.

In your second example, the GPU isn’t seeing the integer array correctly. If you do a programmatic capture, you can show the buffer. It’s seeing garbage. I’m not sure about the memory model, but I don’t think this is the best way of handling it anyway.

This is my attempt:

IntArray.zip (47.9 KB)

When you run it, you’ll find yourself in the GPU debugger, as it does a capture. You can then debug the code for each thread and look at the buffers.

I am assuming that SwiftArray has other elements aside from ar, so I have created some float values in that array and I have a separate 2d array for ar.

In Swift, you just use the same index for each. On the GPU, for the BridgedArray, you use threadIndex, and for the ints, you use threadIndex * number of items in ar.

This will only work if the count of the array elements for ar is the same for each BridgedArray. The GPU isn’t very good at dynamic counts. You’d have to do it a different way if this is not the case.

P.S. you could probably do this in 2d with thread size width and height. You could see if it’s faster with a 2d texture (putting the data in a texture). Also, I haven’t thought about it much, but argument buffers might help too (but possibly not).

tchelyzt · December 16, 2023, 1:27pm

Thanks Caroline,

I see the idea for the solution using your approach if I put a variable

int ar[innerCount];

inside my shader. I’ll never need a tuple in the Swift code.

I’ll get back to you.

tchelyzt · December 17, 2023, 2:08am

Thanks Caroline,
it was so simple and so obvious but I didn’t see it without your help. The [[Int32]] version doesn’t even need a bridge header. Here’s a working version using [[MyEnum]] (which is what I want to process in my actual project. It outputs a copy, a reversal of inner array, a reversed outer array and reversals of inner within outer and is very easy to implement. (Please excuse the silly app name).

Dummy3.zip (25.0 KB)

You mentioned the following:

It seems to me that textures limit the data size to a maximum of 32 bits, unless I’ve misunderstood you.

caroline · December 18, 2023, 12:05am

I’m glad you sorted it.

I didn’t try it with textures, just that in one (some?) of the Apple videos they have suggested that using textures is sometimes more performant than using buffers. Textures are by nature 2D (or 3D), whereas buffers are linear 1D.

Sounds like the number of items you have is too many though, and it would need testing to find out which is quicker. You’d obviously have the overhead of initialising the texture too.

https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf defines the maximum 2D texture width and height to 16,384 pixels. You can define the texture format as integer or float.

tchelyzt · February 18, 2024, 6:58pm

Removed. Think I was talking rubbish.