Trying OpenCL on Guix: An Experience Report
Recently, I wanted to run Leela Zero with q5go to help me get better at playing Go. I run GNU Guix on my machine, and so I did the following: guix package -i leela-zero gnugo q5go
. This got q5go
going and I was able to point it at Leela Zero (~/.guix-profile/bin/leelaz
) as an analysis engine; however, it does not work. Here is my experience determining why.
Like many machine learning programs, Leela Zero can optionally use your GPU utilizing OpenCL to drastically speed up its operations per second. Unfortunately, invoking leelaz
with leelaz --gtp
wasn't working for me. If I invoked Leela Zero like this: leelaz --cpu-only
it worked, but took a long time to analyze moves. This suggested to me that it was an issue with my OpenCL setup. We can use the clinfo
program to troubleshoot this.
guix environment --ad-hoc clinfo -- clinfo
Number of platforms 0
This strongly suggests there is something wrong with Guix's OpenCL setup. We can use strace
to learn more.
guix environment --ad-hoc strace clinfo -- strace -o/dev/stdout -eopenat clinfo
openat(AT_FDCWD, "/gnu/store/2ax9z25142khhqx61ks767jr758pzq5r-clinfo-3.0.21.02.21/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/gnu/store/i70jq190cpc45crbnrw8g8lgb4djyi9r-opencl-icd-loader-2021.06.30/lib/libOpenCL.so.1", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/gnu/store/5h2w4qi9hk1qzzgi1w83220ydslinr4s-glibc-2.33/lib/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/gnu/store/094bbaq6glba86h1d4cj16xhdi6fk2jl-gcc-10.3.0-lib/lib/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/gnu/store/5h2w4qi9hk1qzzgi1w83220ydslinr4s-glibc-2.33/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/gnu/store/5h2w4qi9hk1qzzgi1w83220ydslinr4s-glibc-2.33/lib/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/etc/OpenCL/vendors", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory) Number of platforms 0 +++ exited with 0 +++
Despite mesa-opencl
being installed in the system profile, Guix had not populated /etc/OpenCL
from the package. It is, however present:
find -L /run/current-system/profile -name OpenCL
/run/current-system/profile/etc/OpenCL
This is mystery number one: why isn't this being populated into Guix's root /etc
?
Pointing clinfo
at the vendors directory can be achieved with the OPENCL_VENDOR_PATH
environmental variable. The contents of this file are:
cat /run/current-system/profile/etc/OpenCL/vendors/mesa.icd
/gnu/store/48qh6x7ky8r1cxbfalwzngch4hgnrrr9-mesa-opencl-icd-21.3.8/lib/libMesaOpenCL.so.1
By running strace
, we can see that despite mesa-opencl-icd
being installed in the system's profile, it cannot find the location of the library:
OPENCL_VENDOR_PATH=/run/current-system/profile/etc/OpenCL/vendors guix environment --ad-hoc strace clinfo -- strace -o/dev/stdout -eopenat clinfo
openat(AT_FDCWD, "/gnu/store/2ax9z25142khhqx61ks767jr758pzq5r-clinfo-3.0.21.02.21/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/gnu/store/i70jq190cpc45crbnrw8g8lgb4djyi9r-opencl-icd-loader-2021.06.30/lib/libOpenCL.so.1", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/gnu/store/5h2w4qi9hk1qzzgi1w83220ydslinr4s-glibc-2.33/lib/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/gnu/store/094bbaq6glba86h1d4cj16xhdi6fk2jl-gcc-10.3.0-lib/lib/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/gnu/store/5h2w4qi9hk1qzzgi1w83220ydslinr4s-glibc-2.33/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/gnu/store/5h2w4qi9hk1qzzgi1w83220ydslinr4s-glibc-2.33/lib/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/etc/OpenCL/vendors", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory) Number of platforms 0 +++ exited with 0 +++
It is, however, present:
find -L /run/current-system/profile -name libMesaOpenCL.so.1
/run/current-system/profile/lib/libMesaOpenCL.so.1
This is mystery number two: why can't Guix locate the library?
If we create our own vendors file, populate it with the location of the libMesaOpenCL.so
file, and point clinfo
at this, things begin to look better.
cat ${HOME}/.local/etc/OpenCL/vendors/mesa.icd
/run/current-system/profile/lib/libMesaOpenCL.so.1
OPENCL_VENDOR_PATH=${HOME}/.local/etc/OpenCL/vendors clinfo
Number of platforms 0
However, Leela Zero is still not working:
OPENCL_VENDOR_PATH=${HOME}/.local/etc/OpenCL/vendors leelaz --tune-only 2>&1 || true
A network weights file is required to use the program. By default, Leela Zero looks for it in /home/katco/.local/share/leela-zero/best-network.
There is a curious error from the output of clinfo
:
Preferred work group size multiple <getWGsizes:1200: create kernel : error -46>
If we set the LD_DEBUG
environment variable to libs
, we can shed some light as to what is wrong:
OPENCL_VENDOR_PATH=${HOME}/.local/etc/OpenCL/vendors LD_DEBUG=libs clinfo 2>&1 |grep error
Indeed, this file is not present.
[ -f /gnu/store/h86b3253bc3mnp3p57n1vls2vkfv2h6z-libclc-9.0.1/share/clc/gfx1010-amdgcn-mesa-mesa3d.bc ] echo $?
1
Further research turned up a bug (44841) against libclc
which suggests that while support for my card was included into LLVM v10 (at the time of this writing, LLVM has released v12), libclc
does not support my card's architecture, gfx1010
.
I attempted to build libclc v12.0.0 locally, but it segfaulted. Building v11.0.0 worked, but as suggested by the open bug, support for my card's architecture still has not been implemented.
I briefly entertained creating a Guix package from AMD's amdgpu-pro
packages, but it appears as though my card is not supported, and according to a bug (819) against ROCm, likely won't be.
So it would seem I'm out of luck, and I'm stuck running Leela Zero on the CPU for now. Analyzing one of my games on 60 compute cores took somewhere around ten minutes, so not intractable.
Still, perhaps this helps others running Guix with GPUs that are supported by libclc
.
As an aside, this research is perhaps an indication of why — despite my years of interest in Go — I remain a Kyu player.