aws ec2 tensorflow gpu 不工作答案

【问题标题】：aws ec2 tensorflow gpu not workaws ec2 tensorflow gpu 不工作
【发布时间】：2018-08-25 20:44:14
【问题描述】：

我有一个带有 AMI 的 aws EC2 (p2.xlarge)

深度学习 AMI (Ubuntu) 版本 5.0 - ami-7336d50e

预装最新的深度学习框架二进制文件在不同的虚拟环境中：MXNet、TensorFlow、Caffe、Caffe2、 PyTorch、Keras、Chainer、Theano 和 CNTK。完全配置了 NVidia CUDA、cuDNN 和 NCCL

我尝试在启动我的程序时使用 keras 制作 rnn 我有这个

 I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
 I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
 I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
 I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
 I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally

当 karas 开始后我有了这个

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.17GiB
Free memory: 11.10GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 12639 get requests, put_count=6277 evicted_count=1000 eviction_rate=0.159312 and unsatisfied allocation rate=0.590395
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110

但是当 de program learn 不快时，我的 macbookpro 比我的 EC2 快，并且在每个 epochs 之后我都会收到此警告

tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 4156 get requests, put_count=8233 evicted_count=4000 eviction_rate=0.48585 and unsatisfied allocation rate=0.000481232

我已经安装了 karas_gpu 和 tensorflow_gpu，并且我将 vm 用于带有 tensorflow 的 keras2

如果我做错了什么，你可以告诉我什么，这样一个简单的小 macbook 就可以比 EC2 更快地使用这个规范

p2.xlarge（11.75 ECU、4 vCPU、2.7 GHz、E5-2686v4、61 Gio mémoire、EBS 唯一性）

【问题讨论】：

标签： tensorflow amazon-ec2 keras

【解决方案1】：

回答很简单。在 EC2 AMI (p2.xlarge) 中，gpu 是 Tesla K80，在 TensorFlow 中这个 gpu 加速 4x ~ 10x cpu，在我的 macbook 中我有 8 个 cpu。

【讨论】：