Kernel Writes Incorrect Value to Device Memory

Hello -

I’ve encountered a rather weird issue and was wondering if anyone else has encountered something like it. I am working on a kernel that performs a radex sort. The kernel follows published algorithms by bucketing the last two bits of an unsigned integer using prefix sum etc.

I am testing out the kernel using a single threadgroup with 256 threads. After executing the kernel, I read back the values on the CPU. If everything worked as expected, I’d have an ordered list of unsigned ints.

But at thread 53 the kernel starts writing garbage values to the device memory out buffer. Here’s the weird part: In the GPU frame capture, the calculations are correct, and I can see the correct value in the debugger. If the information in the GPU frame capture were written out to the device memory buffer, I’d have a correctly sorted list. If I write a primitive value to the buffer (like the thread id), the CPU reads back the correct number.

I have been throwing up some memory barriers to try and fix the issue to no avail. Any ideas on what this symptom might mean?

Sorry, no ideas except making sure that the out buffer is big enough to store all the values.

Have you tried

   CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR

since your gpu shares same memory with CPU?

Device is also at the same place of host for your iGPU.

Create some buffers, do stress test on them, if all of them get invalid values then install another driver version probably a newer one, if this doesnt solve, RMA your card.

If only a single buffer is erroneous then it is simple vram error , tag that buffer as unusable and create new buffers as necesary and avoid that buffer but Im not sure if driver swaps buffers in background. If every single kernel is malfunctioning then cores may be damaged too.