在 SageMaker 管理的 AWS ml.p2.xlarge 实例上，Keras/Tensorflow 未检测到 GPU答案

【问题标题】：GPU not detected by Keras/Tensorflow on AWS ml.p2.xlarge instance managed by SageMaker在 SageMaker 管理的 AWS ml.p2.xlarge 实例上，Keras/Tensorflow 未检测到 GPU
【发布时间】：2020-05-25 05:14:16
【问题描述】：

我在 ml.p2.xlarge 实例上使用自定义 Docker 容器与 SageMaker 一起使用。

基础镜像是tiangolo/python-machine-learning:cuda9.1-python3.7，它通常带有所需的CUDA工具包。 python 包通过 conda 安装，使用以下极简主义environment.yaml：

dependencies:
  - boto3
  - joblib
  - keras
  - numpy
  - pandas
  - scikit-learn
  - scipy
  - tensorflow=2.0

但是，当我为小型 lenet5 CNN 运行训练作业时，我在日志中看不到任何 GPU 活动（并且训练持续时间与在非 GPU 实例上一样长）。

更令人担忧的是，len(tf.config.experimental.list_physical_devices('GPU') 返回0，而K.tensorflow_backend._get_available_gpus() 为空。最后，如果我在如下基本操作中检查设备放置（使用tf.debugging.set_log_device_placement(True)）：

a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)

我明白了

Executing op _MklMatMul in device /job:localhost/replica:0/task:0/device:CPU:0

确认操作已在 CPU 上进行。

起初我以为我的用例太轻而无法触发 GPU 使用，但似乎根本没有检测到 GPU！我是否缺少此工作所需的任何步骤或组件？

【问题讨论】：

标签： keras gpu amazon-sagemaker

【解决方案1】：

我们需要安装所有必要的 cuda 和其他图形驱动程序才能使其工作。考虑 SageMaker TensorFlow dockerfile 中的以下步骤，例如：

RUN apt-get update \
 && apt-get install -y --no-install-recommends --allow-unauthenticated \
    python3-dev \
    python3-pip \
    python3-setuptools \
    python3-dev \
    ca-certificates \
    cuda-command-line-tools-10-0 \
    cuda-cublas-dev-10-0 \
    cuda-cudart-dev-10-0 \
    cuda-cufft-dev-10-0 \
    cuda-curand-dev-10-0 \
    cuda-cusolver-dev-10-0 \
    cuda-cusparse-dev-10-0 \
    curl \
    libcudnn7=7.5.1.10-1+cuda10.0 \
    # TensorFlow doesn't require libnccl anymore but Open MPI still depends on it
    libnccl2=2.4.7-1+cuda10.0 \
    libgomp1 \
    libnccl-dev=2.4.7-1+cuda10.0 \
    libfreetype6-dev \
    libhdf5-serial-dev \
    libpng-dev \
    libzmq3-dev \
    git \
    wget \
    vim \
    build-essential \
    openssh-client \
    openssh-server \
    zlib1g-dev \
    # The 'apt-get install' of nvinfer-runtime-trt-repo-ubuntu1804-5.0.2-ga-cuda10.0
    # adds a new list which contains libnvinfer library, so it needs another
    # 'apt-get update' to retrieve that list before it can actually install the
    # library.
    # We don't install libnvinfer-dev since we don't need to build against TensorRT,
    # and libnvinfer4 doesn't contain libnvinfer.a static library.
 && apt-get update && apt-get install -y --no-install-recommends --allow-unauthenticated  \
    nvinfer-runtime-trt-repo-ubuntu1804-5.0.2-ga-cuda10.0 \
 && apt-get update && apt-get install -y --no-install-recommends --allow-unauthenticated  \
    libnvinfer5=5.0.2-1+cuda10.0 \
 && rm /usr/lib/x86_64-linux-gnu/libnvinfer_plugin* \
 && rm /usr/lib/x86_64-linux-gnu/libnvcaffe_parser* \
 && rm /usr/lib/x86_64-linux-gnu/libnvparsers* \
 && rm -rf /var/lib/apt/lists/* \
 && mkdir -p /var/run/sshd

运行上述所有命令，如果找到 GPU，请重试

【讨论】：

【解决方案2】：

我建议从 SageMaker 提供的环境开始，以确保您拥有经过测试、最新且可用于生产的设置。特别是对于 Tensorflow 和 Keras 来说：

在 SageMaker Notebooks 上，conda_tensorflow_p* jupyter 内核
对于 SageMaker 训练和推理任务，TensorFlow 框架容器 (container on github, orchestration with python sdk)

【讨论】：

谢谢。在 SageMaker 笔记本实例中，显然检测到了 GPU。然而，就独立训练作业而言，我从您链接到的 sageMaker TensorFlow container 创建了一个容器，指定了 GPU 版本并提供了 tensorflow-gpu 2.1.0 轮子，因此一切都是 GPU，但无济于事！
您可以尝试使用 sagemaker SDK 吗？它会自动在后台使用正确的 docker 镜像sagemaker.readthedocs.io/en/stable/…
我会试一试，但最初的目标是在没有程序干预的情况下从仪表板运行训练作业，这需要我自己设计正确的图像:)
您指的是哪个仪表板？亚马逊 Sagemaker 控制台？
是的，很抱歉不够清晰