在 python docker 镜像上使用 GPU答案

【问题标题】：Use GPU on python docker image在 python docker 镜像上使用 GPU
【发布时间】：2021-04-17 13:16:04
【问题描述】：

我正在使用python:3.7.4-slim-buster docker 映像，但我无法更改它。我想知道如何在上面使用我的 nvidia gpus。

我通常使用tensorflow/tensorflow:1.14.0-gpu-py3 和一个简单的--runtime=nvidia int docker run 命令一切正常，但现在我有这个限制。

我认为这种类型的图像不存在快捷方式，因此我遵循本指南 https://towardsdatascience.com/how-to-properly-use-the-gpu-within-a-docker-container-4c699c78c6d1，构建它建议的 Dockerfile：

FROM python:3.7.4-slim-buster

RUN apt-get update && apt-get install -y build-essential
RUN apt-get --purge remove -y nvidia*
ADD ./Downloads/nvidia_installers /tmp/nvidia                             > Get the install files you used to install CUDA and the NVIDIA drivers on your host
RUN /tmp/nvidia/NVIDIA-Linux-x86_64-331.62.run -s -N --no-kernel-module   > Install the driver.
RUN rm -rf /tmp/selfgz7                                                   > For some reason the driver installer left temp files when used during a docker build (i dont have any explanation why) and the CUDA installer will fail if there still there so we delete them.
RUN /tmp/nvidia/cuda-linux64-rel-6.0.37-18176142.run -noprompt            > CUDA driver installer.
RUN /tmp/nvidia/cuda-samples-linux-6.0.37-18176142.run -noprompt -cudaprefix=/usr/local/cuda-6.0   > CUDA samples comment if you dont want them.
RUN export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64         > Add CUDA library into your PATH
RUN touch /etc/ld.so.conf.d/cuda.conf                                     > Update the ld.so.conf.d directory
RUN rm -rf /temp/*  > Delete installer files.

但它会引发错误：

ADD failed: stat /var/lib/docker/tmp/docker-builder080208872/Downloads/nvidia_installers: no such file or directory

我可以改变什么来轻松地让 docker 镜像看到我的 gpus？

【问题讨论】：

将您的 Dockerfile 建立在 nvidia-docker image 上。您需要在主机上安装 cuda 驱动程序。
你不能改变python:3.7.4-slim-buster，因为你需要这个特定的python版本，对吧？
@anemyte 是的，至少我不应该使用 tensorflow/pytorch 图像。最好保留这个。为什么？你有什么想法吗？
我有一个基于 Ubuntu 的带有 python3.7 和 CUDA 的 Dockerfile。我用它来为我的需要创建一个自定义的 Tensorflow 图像。如果这符合您的需求，我会将其作为答案发布。
@sim 谢谢你。你能给我一个 Dockerfile 示例来保持我的图像不变吗？

标签： python docker dockerfile gpu nvidia

【解决方案1】：

TensorFlow 图像拆分为多个“部分”Dockerfile。 One of them 包含 TensorFlow 在 GPU 上运行所需的所有依赖项。使用它，您可以轻松创建自定义图像，您只需将默认 python 更改为您需要的任何版本。在我看来，这比将 NVIDIA 的东西带入 Debian 映像（CUDA 和/或 cuDNN 未正式支持 AFAIK）要容易得多。

这是 Dockerfile：

# TensorFlow image base written by TensorFlow authors.
# Source: https://github.com/tensorflow/tensorflow/blob/v2.3.0/tensorflow/tools/dockerfiles/partials/ubuntu/nvidia.partial.Dockerfile
# -------------------------------------------------------------------------
ARG ARCH=
ARG CUDA=10.1
FROM nvidia/cuda${ARCH:+-$ARCH}:${CUDA}-base-ubuntu${UBUNTU_VERSION} as base
# ARCH and CUDA are specified again because the FROM directive resets ARGs
# (but their default value is retained if set previously)
ARG ARCH
ARG CUDA
ARG CUDNN=7.6.4.38-1
ARG CUDNN_MAJOR_VERSION=7
ARG LIB_DIR_PREFIX=x86_64
ARG LIBNVINFER=6.0.1-1
ARG LIBNVINFER_MAJOR_VERSION=6

# Needed for string substitution
SHELL ["/bin/bash", "-c"]
# Pick up some TF dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
        build-essential \
        cuda-command-line-tools-${CUDA/./-} \
        # There appears to be a regression in libcublas10=10.2.2.89-1 which
        # prevents cublas from initializing in TF. See
        # https://github.com/tensorflow/tensorflow/issues/9489#issuecomment-562394257
        libcublas10=10.2.1.243-1 \ 
        cuda-nvrtc-${CUDA/./-} \
        cuda-cufft-${CUDA/./-} \
        cuda-curand-${CUDA/./-} \
        cuda-cusolver-${CUDA/./-} \
        cuda-cusparse-${CUDA/./-} \
        curl \
        libcudnn7=${CUDNN}+cuda${CUDA} \
        libfreetype6-dev \
        libhdf5-serial-dev \
        libzmq3-dev \
        pkg-config \
        software-properties-common \
        unzip

# Install TensorRT if not building for PowerPC
RUN [[ "${ARCH}" = "ppc64le" ]] || { apt-get update && \
        apt-get install -y --no-install-recommends libnvinfer${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA} \
        libnvinfer-plugin${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA} \
        && apt-get clean \
        && rm -rf /var/lib/apt/lists/*; }

# For CUDA profiling, TensorFlow requires CUPTI.
ENV LD_LIBRARY_PATH /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Link the libcuda stub to the location where tensorflow is searching for it and reconfigure
# dynamic linker run-time bindings
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 \
    && echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/z-cuda-stubs.conf \
    && ldconfig
# -------------------------------------------------------------------------
#
# Custom part
FROM base
ARG PYTHON_VERSION=3.7

RUN apt-get update && apt-get install -y --no-install-recommends --no-install-suggests \
          python${PYTHON_VERSION} \
          python3-pip \
          python${PYTHON_VERSION}-dev \
# Change default python
    && cd /usr/bin \
    && ln -sf python${PYTHON_VERSION}         python3 \
    && ln -sf python${PYTHON_VERSION}m        python3m \
    && ln -sf python${PYTHON_VERSION}-config  python3-config \
    && ln -sf python${PYTHON_VERSION}m-config python3m-config \
    && ln -sf python3                         /usr/bin/python \
# Update pip and add common packages
    && python -m pip install --upgrade pip \
    && python -m pip install --upgrade \
        setuptools \
        wheel \
        six \
# Cleanup
    && apt-get clean \
    && rm -rf $HOME/.cache/pip

您可以从这里开始：将 python 版本更改为您需要的版本（在 Ubuntu 存储库中可用），添加包、代码等。

【讨论】：

这对我的解决方案很有帮助。我唯一改变的是我为python部分复制了官方的python Dockerfile代码。指出如果其他一些使用 GPU 的库将需要 TensorFlow Dockerfile 的“开发”版本（而不是上面的基本版本）才能运行，例如隐式库。