techhub.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A hub primarily for passionate technologists, but everyone is welcome

Administered by:

Server stats:

5.4K
active users

#opencl

0 posts0 participants0 posts today
Dr. Moritz Lehmann<p>I made this <a href="https://mast.hpc.social/tags/FluidX3D" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>FluidX3D</span></a> <a href="https://mast.hpc.social/tags/CFD" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CFD</span></a> simulation run on a frankenstein zoo of 🟥AMD + 🟩Nvidia + 🟦Intel <a href="https://mast.hpc.social/tags/GPU" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>GPU</span></a>​s! 🖖🤪<br><a href="https://www.youtube.com/watch?v=_8Ed8ET9gBU" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://www.</span><span class="ellipsis">youtube.com/watch?v=_8Ed8ET9gB</span><span class="invisible">U</span></a></p><p>The ultimate SLI abomination setup:<br>- 1x Nvidia A100 40GB<br>- 1x Nvidia Tesla P100 16GB<br>- 2x Nvidia A2 15GB<br>- 3x AMD Instinct MI50<br>- 1x Intel Arc A770 16GB</p><p>I split the 2.5B cells in 9 domains of 15GB - A100 takes 2 domains, the other GPUs 1 domain each. The GPUs communicate over PCIe via <a href="https://mast.hpc.social/tags/OpenCL" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OpenCL</span></a>.</p><p>Huge thanks to Tobias Ribizel from TUM for the hardware!</p>
Giuseppe Bilotta<p>I'm liking the class this year. Students are attentive and participating, and the discussion is always productive.</p><p>We were discussing the rounding up of the launch grid in <a href="https://fediscience.org/tags/OpenCL" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OpenCL</span></a> to avoid the catastrophic performance drops that come from the inability to divide the “actual” work size by anything smaller than the maximum device local work size, and were discussing on how to compute the “rounded up” work size.</p><p>The idea is this: given the worksize N and the local size L, we have to round N to the smallest multiple of L that is not smaller than N. This effectively means computing D = ceili(N/L) and then using D*L.</p><p>There are several ways to compute D, but on the computer, working only with integers and knowing that integer division always rounded down, what is the “best way”?</p><p>D = N/L + 1 works well if N is not a multiple of L, but gives us 1 more than the intended result if N *is* a multiple of L. So we want to add the extra 1 only if N is not a multiple. This can be achieved for example with</p><p>D = N/L + !!(N % L)</p><p>which leverages the fact that !! (double logical negation) turns any non-zero value into 1, leaving zero as zero. So we round *down* (which is what the integer division does) and then add 1 if (and only if) there is a reminder to the division.</p><p>This is ugly not so much because of the !!, but because the modulus operation % is slow.</p><p>1/n</p>
mirror::box::milo<p><a href="https://mstdn.social/tags/generativeart" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>generativeart</span></a> <a href="https://mstdn.social/tags/artificiallife" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>artificiallife</span></a> <a href="https://mstdn.social/tags/reactiondiffusion" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>reactiondiffusion</span></a> <a href="https://mstdn.social/tags/blender" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>blender</span></a> <a href="https://mstdn.social/tags/opencl" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>opencl</span></a> <a href="https://mstdn.social/tags/creativecoding" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>creativecoding</span></a> <a href="https://mstdn.social/tags/emergence" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>emergence</span></a> <a href="https://mstdn.social/tags/isosurface" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>isosurface</span></a></p>
HGPU group<p>LLMPerf: GPU Performance Modeling meets Large Language Models</p><p><a href="https://mast.hpc.social/tags/OpenCL" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OpenCL</span></a> <a href="https://mast.hpc.social/tags/LLM" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>LLM</span></a> <a href="https://mast.hpc.social/tags/Performance" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Performance</span></a> <a href="https://mast.hpc.social/tags/Package" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Package</span></a></p><p><a href="https://hgpu.org/?p=29826" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="">hgpu.org/?p=29826</span><span class="invisible"></span></a></p>
Dr. Moritz Lehmann<p>I got access to <span class="h-card" translate="no"><a href="https://mastodon.social/@LRZ_DE" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>LRZ_DE</span></a></span>'s new coma-cluster for <a href="https://mast.hpc.social/tags/OpenCL" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OpenCL</span></a> benchmarking and experimentation 🖖😋💻🥨🍻<br>I've added a ton of new <a href="https://mast.hpc.social/tags/FluidX3D" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>FluidX3D</span></a> <a href="https://mast.hpc.social/tags/CFD" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CFD</span></a> <a href="https://mast.hpc.social/tags/GPU" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>GPU</span></a>​/​<a href="https://mast.hpc.social/tags/CPU" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CPU</span></a> benchmarks:<br><a href="https://github.com/ProjectPhysX/FluidX3D?tab=readme-ov-file#single-gpucpu-benchmarks" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">github.com/ProjectPhysX/FluidX</span><span class="invisible">3D?tab=readme-ov-file#single-gpucpu-benchmarks</span></a></p><p>Notable hardware configurations include:<br>- 4x H100 NVL 94GB<br>- 2x Nvidia L40S 48GB<br>- 2x Nvidia A2 15GB datacenter toaster<br>- 2x Intel Arc A770 16GB<br>- AMD+Nvidia SLI abomination consisting of 3x Instinct MI50 32GB + 1x A100 40GB<br>- AMD Radeon 8060S (chonky Ryzen AI Max+ 395 iGPU with quad-channel RAM) thanks to <span class="h-card" translate="no"><a href="https://mast.hpc.social/@cheese" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@<span>cheese</span></a></span></p>
Dave Tabb<p>I had hoped that DIA-NN 2.02 could be accelerated by either ATI or nVidia GPUs, but I have tried with both types of cards to no avail.<br><a href="https://mastodon.africa/tags/bioinformatic" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>bioinformatic</span></a> <a href="https://mastodon.africa/tags/proteomics" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>proteomics</span></a> <a href="https://mastodon.africa/tags/OpenCL" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OpenCL</span></a> <a href="https://mastodon.africa/tags/CUDA" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CUDA</span></a></p>
.:\dGh/:.<p>Is there any difference between computing AI workloads in Vulkan, OpenCL and CUDA?</p><p>I know that some people say that NVIDIA doesn't support (quite well) OpenCL or Vulkan, performance is achieved by using CUDA. But what is the story for other vendors (Intel, AMD, QualComm, Apple) ?</p><p><a href="https://mastodon.social/tags/AI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>AI</span></a> <a href="https://mastodon.social/tags/Programming" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Programming</span></a> <a href="https://mastodon.social/tags/AIProgramming" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>AIProgramming</span></a> <a href="https://mastodon.social/tags/AIDevelopment" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>AIDevelopment</span></a> <a href="https://mastodon.social/tags/Software" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Software</span></a> <a href="https://mastodon.social/tags/SoftwareDevelopment" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>SoftwareDevelopment</span></a> <a href="https://mastodon.social/tags/Vulkan" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Vulkan</span></a> <a href="https://mastodon.social/tags/OpenCL" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OpenCL</span></a> <a href="https://mastodon.social/tags/CUDA" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>CUDA</span></a> <a href="https://mastodon.social/tags/NVIDIA" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>NVIDIA</span></a> <a href="https://mastodon.social/tags/Intel" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Intel</span></a> <a href="https://mastodon.social/tags/IntelArc" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>IntelArc</span></a> <a href="https://mastodon.social/tags/AMD" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>AMD</span></a> <a href="https://mastodon.social/tags/AMDRadeon" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>AMDRadeon</span></a> <a href="https://mastodon.social/tags/Radeon" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Radeon</span></a> <a href="https://mastodon.social/tags/Qualcomm" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Qualcomm</span></a> <a href="https://mastodon.social/tags/Apple" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Apple</span></a> <a href="https://mastodon.social/tags/AppleSilicon" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>AppleSilicon</span></a> <a href="https://mastodon.social/tags/AppleM4" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>AppleM4</span></a> <a href="https://mastodon.social/tags/M4" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>M4</span></a></p>
GPUOpen<p>🧐 AMD Radeon GPU Analyzer (RGA) is our performance analysis tool for <a href="https://mastodon.gamedev.place/tags/DirectX" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>DirectX</span></a>, <a href="https://mastodon.gamedev.place/tags/Vulkan" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Vulkan</span></a>, SPIR-V, <a href="https://mastodon.gamedev.place/tags/OpenGL" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OpenGL</span></a>, &amp; <a href="https://mastodon.gamedev.place/tags/OpenCL" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OpenCL</span></a>.<br> <br>✨As well as updates for AMD RDNA 4, there's enhancements to the ISA view UI, using the same updated UI as RGP ✨</p><p>More detail: <a href="https://gpuopen.com/learn/rdna-cdna-architecture-disassembly-radeon-gpu-analyzer-2-12/?utm_source=mastodon&amp;utm_medium=social&amp;utm_campaign=rdts" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">gpuopen.com/learn/rdna-cdna-ar</span><span class="invisible">chitecture-disassembly-radeon-gpu-analyzer-2-12/?utm_source=mastodon&amp;utm_medium=social&amp;utm_campaign=rdts</span></a><br>(🧵5/7)</p>
Dr. Moritz Lehmann<p>Here's my <a href="https://mast.hpc.social/tags/OpenCL" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OpenCL</span></a> implementation: <a href="https://github.com/ProjectPhysX/FluidX3D/blob/master/src/kernel.cpp#L1924-L1993" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">github.com/ProjectPhysX/FluidX</span><span class="invisible">3D/blob/master/src/kernel.cpp#L1924-L1993</span></a></p>

#FluidX3D #CFD v3.2 is out! I've implemented the much requested #GPU summation for object force/torque; it's ~20x faster than #CPU #multithreading. 🖖😋
Horizontal sum in #OpenCL was a nice exercise - first local memory reduction and then hardware-supported atomic floating-point add in VRAM, in a single-stage kernel. Hammering atomics isn't too bad as each of the ~10-340 workgroups dispatched at a time does only a single atomic add.
Also improved volumetric #raytracing!
github.com/ProjectPhysX/FluidX

Hot Aisle's 8x AMD #MI300X server is the fastest computer I've ever tested in #FluidX3D #CFD, achieving a peak #LBM performance of 205 GLUPs/s, and a combined VRAM bandwidth of 23 TB/s. 🖖🤯
The #RTX 5090 looks like a toy in comparison.

MI300X beats even Nvidia's GH200 94GB. This marks a very fascinating inflection point in #GPGPU: #CUDA is not the performance leader anymore. 🖖😛
You need a cross-vendor language like #OpenCL to leverage its power.

FluidX3D on #GitHub: github.com/ProjectPhysX/FluidX

Continued thread

I added hardware dp4a support in #OpenCL also for Intel/AMD #GPU​s.
6 (!) spec/driver bugs needed workarounds:
- CL_DEVICE_OPENCL_C_VERSION unreliable: reports 1.2 if 3.0 is supported but not 2.X
- CL_DEVICE_OPENCL_C_ALL_VERSIONS broken for AMD
- CL_DEVICE_INTEGER_DOT_PRODUCT.. does UB on old Intel drivers
- dp4a feature macro only supported for -cl-std=CL3.0, and always falsely set on old Intel driver
- dot_acc_sat(a,b,c) on Intel gets translated to slow add_sat(dot(a,b),c), must be c+dot(a,b)

My OpenCL-Benchmark now uses the dp4a instruction on supported hardware (#Nvidia Pascal, #Intel #Arc, #AMD RDNA, or newer) to benchmark INT8 tghroughput.
dp4a is not exposed in #OpenCL C, but can still be used via inline PTX assembly and compiler pattern recognition. Even Nvidia's compiler will turn the emulation implementation into dp4a, but in some cases does so with a bunch of unnecessary shifts/permutations on inputs, so better use inline PTX directly. 🖖🧐
github.com/ProjectPhysX/OpenCL

I'm getting the material ready for my upcoming #GPGPU course that starts on March. Even though I most probably won't get to it,I also checked my trivial #SYCL programs. Apparently the 2025.0 version of the #Intel #OneAPI #DPCPP runtime doesn't like any #OpenCL platform except Intel's own (I have two other platforms that support #SPIRV, so why aren't they showing up? From the documentation I can find online this should be sufficient, but apparently it's not …)

#FluidX3D #CFD v3.1 is out! I have updated the #OpenCL headers for better device specs detection via device ID and #Nvidia compute capability, fixed broken voxelization on some #GPU​s and added a workaround for a CPU compiler bug that corrupted rendering. Also #AMD GPUs will now show up with their correct name (no idea why AMD can't report it as CL_DEVICE_NAME like every other sane vendor and instead need CL_DEVICE_BOARD_NAME_AMD extension...)
Have fun! 🖖😉
github.com/ProjectPhysX/FluidX

Thank you for using FluidX3D! Update v3.1 brings two critical bug fixes/workarounds and various small improvements under the hood:

Improvements

faster enqueueReadBuffer() on modern CPUs with 64-B...
GitHubRelease FluidX3D v3.1 (more bug fixes) · ProjectPhysX/FluidX3DThank you for using FluidX3D! Update v3.1 brings two critical bug fixes/workarounds and various small improvements under the hood: Improvements faster enqueueReadBuffer() on modern CPUs with 64-B...