#FluidX3D #CFD v3.2 is out! I've implemented the much requested #GPU summation for object force/torque; it's ~20x faster than #CPU #multithreading.
Horizontal sum in #OpenCL was a nice exercise - first local memory reduction and then hardware-supported atomic floating-point add in VRAM, in a single-stage kernel. Hammering atomics isn't too bad as each of the ~10-340 workgroups dispatched at a time does only a single atomic add.
Also improved volumetric #raytracing!
https://github.com/ProjectPhysX/FluidX3D/releases/tag/v3.2
Hot Aisle's 8x AMD #MI300X server is the fastest computer I've ever tested in #FluidX3D #CFD, achieving a peak #LBM performance of 205 GLUPs/s, and a combined VRAM bandwidth of 23 TB/s.
The #RTX 5090 looks like a toy in comparison.
MI300X beats even Nvidia's GH200 94GB. This marks a very fascinating inflection point in #GPGPU: #CUDA is not the performance leader anymore.
You need a cross-vendor language like #OpenCL to leverage its power.
FluidX3D on #GitHub: https://github.com/ProjectPhysX/FluidX3D
pyATF: Constraint-Based Auto-Tuning in Python
#OpenCL #CUDA #Performance #AutoTuning #Compilers #Python #Package
I added hardware dp4a support in #OpenCL also for Intel/AMD #GPUs.
6 (!) spec/driver bugs needed workarounds:
- CL_DEVICE_OPENCL_C_VERSION unreliable: reports 1.2 if 3.0 is supported but not 2.X
- CL_DEVICE_OPENCL_C_ALL_VERSIONS broken for AMD
- CL_DEVICE_INTEGER_DOT_PRODUCT.. does UB on old Intel drivers
- dp4a feature macro only supported for -cl-std=CL3.0, and always falsely set on old Intel driver
- dot_acc_sat(a,b,c) on Intel gets translated to slow add_sat(dot(a,b),c), must be c+dot(a,b)
My OpenCL-Benchmark now uses the dp4a instruction on supported hardware (#Nvidia Pascal, #Intel #Arc, #AMD RDNA, or newer) to benchmark INT8 tghroughput.
dp4a is not exposed in #OpenCL C, but can still be used via inline PTX assembly and compiler pattern recognition. Even Nvidia's compiler will turn the emulation implementation into dp4a, but in some cases does so with a bunch of unnecessary shifts/permutations on inputs, so better use inline PTX directly.
https://github.com/ProjectPhysX/OpenCL-Benchmark/releases/tag/v1.8
I'm getting the material ready for my upcoming #GPGPU course that starts on March. Even though I most probably won't get to it,I also checked my trivial #SYCL programs. Apparently the 2025.0 version of the #Intel #OneAPI #DPCPP runtime doesn't like any #OpenCL platform except Intel's own (I have two other platforms that support #SPIRV, so why aren't they showing up? From the documentation I can find online this should be sufficient, but apparently it's not …)
Arm Mali #Panfrost Driver Lands #OpenCL C Support In Mesa 25.1
https://www.phoronix.com/news/Panfrost-Lands-OpenCL-C #OpenSource
Other things I have tested with FreeBSD: OpenCL with Discrete GPU via PyOpenCL lib
#FluidX3D #CFD v3.1 is out! I have updated the #OpenCL headers for better device specs detection via device ID and #Nvidia compute capability, fixed broken voxelization on some #GPUs and added a workaround for a CPU compiler bug that corrupted rendering. Also #AMD GPUs will now show up with their correct name (no idea why AMD can't report it as CL_DEVICE_NAME like every other sane vendor and instead need CL_DEVICE_BOARD_NAME_AMD extension...)
Have fun!
https://github.com/ProjectPhysX/FluidX3D/releases/tag/v3.1
Uuhhh my university has the new #HPC cluster online, with a couple dual Intel Xeon Platinum 8480+ nodes with 2TB RAM and quad #GPUs each. Guess who still has access!
#OpenCL specs: https://opencl.gpuinfo.org/displayreport.php?id=4731
Cluster website: https://www.hpc.uni-bayreuth.de/clusters/festus/#compute-nodes