【发布时间】:2019-11-12 23:07:00
【问题描述】:
docker stack deploy with GPU,但找不到 nvidia 设备
说明:
当我使用docker-compose up启动程序时,代码运行良好!但是当我使用docker stack deploy -c docker-compose.yml test 启动程序时,它找不到可见的nvidia 设备。我的 docker-compose.yml 和错误日志如下所示。
我很困惑为什么我有相同的配置,使用docker-compose up 和docker stack deploy -c docker-compose.yml test 的不同启动方式,第一个运行良好,但第二个不行。目前对于 docker swarm 对 GPU 的支持是否不完美,或者还有其他方法我没有找到?
环境配置
docker version: 18.06.0-ce
NVIDIA Docker: 1.0.1
Ubuntu: 16:04
/etc/docker/daemon.json
当然,我修改了文件/etc/docker/daemon.json,改变了运行时类型。并重新启动它。
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
sudo systemctl daemon-reload
sudo systemctl start docker
docker-compose.yml 配置文件
version: "3"
volumes:
nvidia_driver_430.14:
external: true
services:
tts-server:
build:
context: ./
dockerfile: ./docker/tts_server/Dockerfile
deploy:
replicas: 1
image: tts-system/tts-server-gpu
environment:
NVIDIA_VISIBLE_DEVICES: 0
devices:
- /dev/nvidia0
- /dev/nvidiactl
- /dev/nvidia-uvm
volumes:
- ./models:/tts_system/models:ro
- ./config:/tts_system/config:ro
- nvidia_driver_430.14:/usr/local/nvidia:ro
networks:
- overlay
ports:
- "9091:9090"
程序错误日志
2019-07-02 07:50:24.805114: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499885000 Hz
2019-07-02 07:50:24.808418: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4112870 executing computations on platform Host. Devices:
2019-07-02 07:50:24.808457: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2019-07-02 07:50:24.811640: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-07-02 07:50:24.811684: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:155] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
E0702 07:50:24.811846 1 decoder.cc:80] Filed to create session: Invalid argument: 'visible_device_list' listed an invalid GPU id '0' but visible device count is -1
这个问题困扰了我很久,非常感谢。
【问题讨论】:
标签: docker deployment stack gpu