Page MenuHomeSolus

Packaging rest of the ROCm stack
Open, NormalPublic

Description

Now that T6614 is mostly complete Solus should soon hae OpenCL support on AMD as well as the base HIP package. But this is not enough for most programs that support AMD GPU acceleration using ROCm to work. Here is an incomplete list of packages that could make use of ROCm and necessary dependencies:

Here is my repository where I gather all I do with ROCm stack, including WIP packages.
All help with packaging and testing is welcome, as well as ideas on useful packages.
Also, all packages dependent on HIP seem to have spotty GPU support and take a lot of time to compile, because there is a separate kernel generated for each supported GPU.

Related Objects

Event Timeline

joebonrichie edited projects, added Software; removed Lacks Project.
joebonrichie moved this task from Backlog to Improvement on the Software board.

The main holdback is LLVM 15. For the past month or so, I've attempted both ROCm 5.1 and 5.2, and here are the main takeaways:

  • ROCm 5.1
    • rocfft can't build due to cryptic LLVM machine code errors, which basically prevents building any ML software for ROCm.
    • Can't build blender because it requires the __noinline__ attribute to be defined, which is only available for LLVM 15.
  • ROCm 5.2
    • All packages can be built successfully with the current LLVM 14, but it requires 6 patches (including the __noinline__ one above).
    • Even with the patched LLVM, blender still fails to build due to some other cryptic LLVM machine code errors.
    • PyTorch fails to detect the correct device machine code ("hipErrorNoBinaryForGpu: Unable to find code object for all current devices!") no matter what I do. HSA_OVERRIDE_GFX_VERSIONS with AMD_LOG_LEVEL debugging shows that it jumps between machine code detections (e.g. says the file is for gfx1032 when I'm emulating gfx1030, and says the file is for gfx1030 when I'm on my own gfx1032 device).

Full list of patches required to build ROCm with our LLVM:

  1. Code object v5: afc9d674fe5a14b95c50a38d8605a159c2460427
  2. Link code objects correctly: 092f15ac40ce35d077e0225a4462bc4dfa379391
  3. __noinline__: d4e4ef2e81e03246e29e9b6eaa2929ebd4e77784

(the rest are required to apply the __noinline__ patch)

  1. 6655c5a6bb13a7db483d1eea6e1071972b13a62d
  2. 223b8240223541d3feb0c96b7f9bac114cd72f46
  3. 56e7d6bd444cef8d879adc35dcf461cb4d2ed6d5

Repo for 5.2.3 here if anyone wants to take a stab.

Feel free to package the rocm specific bundled llvm as for the time being if you want. I think it'll be at least till LLVM 17 that rocm and upstream llvm are better aligned.

@GZGavinZhao - is this task deprecated or still relevant after the recent work on the ROCm stack?

ermo added a subscriber: ermo.

Re-assigning this to Jacek as this will be one of his responsibilities when he joins.

There are 2 things to be worked on with ROCm:

  • Make Blender build with our ROCm stack. No additional dependencies are needed, this looks like something with Blender's build system, but might need some tweaks to -HIP
  • Package all dependencies necessary for likes of pyTorch or Tensorflow. Even if we do all the work it will still need to be decided if we want to enable it in our repo, as building kernels for all GPU architectures will take a lot of time
  • When new LLVM eventually comes we will have to update to ROCm 5.4.x

Gavin has done a lot of work on packaging Tensorflow already, and we both spend some time trying to make Blender build. Not sure who will take those tasks on, or when