Nvidia & Docker – Failed to initialize NVML

This one is not going to be crazy technical but I haven’t written in a while and thought this might be helpful to many out there. After containerizing my GPUs workloads in my home lab, I noticed that seemingly randomly my GPU-enabled containers would start throwing the following error (e.g. when running nvidia-smi): As it turns out, dozens have been reporting the issue in the past couple of years – but never were there a solid solution. It was pointed out that it wasn’t “random” after all, and that running systemctl daemon-reload would instantly trigger the issue, and that the container(s) had to be restarted before the GPU(s) could be used within the containers once again. It all started with Docker 20.10, which then started using cgroup v2 (aka unified cgroup hierarchy) instead of cgroupfs (if enabled on the operating system), and as distros started enabling it by default (e.g. […]