TensorFlow 新 Op CUDA 内核内存管理答案

【问题标题】：Tensorflow new Op CUDA kernel memory managementTensorFlow 新 Op CUDA 内核内存管理
【发布时间】：2018-07-12 19:56:06
【问题描述】：

我在 Tensorflow 中使用 GPU CUDA 内核实现了一个相当复杂的新操作。这个操作需要大量动态内存分配变量，这些变量不是张量，并且在操作完成后被释放，更具体地说，它涉及使用哈希表。

现在我正在使用 cudaMalloc() 和 cudaFree()，但我注意到 Tensorflow 有自己的类型，称为 Eigen::GPUDevice，它能够在 GPU 上分配和释放内存。

我的问题：

使用Eigen::GPUDevice 管理GPU 内存是最佳实践吗？
通过使用 Eigen::GPUDevice 而不是 CUDA API，我可以“自动”启用多 GPU 支持，因为可以将不同的 GPUDevices 传递给 Op；
我是否应该将此想法扩展到 CPU 内核，看看是否有 CPUDevice 类型也管理内存而不是使用 C++ 语法（即 auto var = new int[100]; delete[] var）

【问题讨论】：

标签： tensorflow gpu

【解决方案1】：

没有针对此问题的直接公开指南。我通常只是让 TensorFlow 通过

分配这些信息

template<typename Device, typename Dtype>
class MyOp: public OpKernel {
{
public:
  explicit MyOp(OpKernelConstruction *context) :
      OpKernel(context)
  {
    // ...
  }

  void Compute(OpKernelContext *context) override
  {
    Tensor* tmp_var = nullptr;
    Tensor* output = nullptr;

    TensorShape some_shape, some_shape2;

    // temparily use this space
    OP_REQUIRES_OK(ctx, ctx->allocate_temp(DT_FLOAT, some_shape, &tmp_var));
    // allocate memory for output tensor
    OP_REQUIRES_OK(ctx, ctx->allocate_output(0, some_shape2, &output));

任何需要内存的东西都应该由 TensorFlow 上下文分配，而不是通过自定义 cudaMalloc 或 new type[num] 调用。
上下文应该为分配器提供信息
见下文

考虑一下，为了简单起见，只需添加两个矩阵 (full example)。 TensorFlow-Operations 通常包含以下结构：

Op description有REGISTER_OP，负责检查形状，设置输出形状（example）

OpKernel 负责分配内存、获取指向输入的指针和设置内容（见上文或this）

Functor 用于实现本身，例如

Tensor* output = nullptr;
Tensor* tmp_var = nullptr;
OP_REQUIRES_OK(ctx, ctx->allocate_output(0, output_shape, &output));
OP_REQUIRES_OK(ctx, ctx->allocate_temp(0, some_shape, &tmp_var));
// the function does not need to care about the memory allocation as everything is already setup at this point
::tensorflow::functor::MyFunctor<Device, Dtype>()(ctx, inputA, inputB, tmp_var, output);

你只剩下实现了

    // gpu version
    template <typename Dtype>
    struct MyFunctor<GPUDevice, Dtype> {
      void operator ()(::tensorflow::OpKernelContext* ctx,...)

    // cpu version
    template <typename Dtype>
    struct MyFunctor<CPUDevice, Dtype> {
      void operator ()(::tensorflow::OpKernelContext* ctx,...)

编辑

allocate_persistent：如果您在 Op 调用之间需要数据，例如一次性索引结构，请使用它。[example]
allocate_temp 只是在Compute 方法生命周期结束时不会保留的tmp 内存。 [example]

但我强烈建议阅读source-code here 中的评论，然后根据您的用例决定。

【讨论】：

感谢您的回答和所有示例。你能评论一下为什么你使用allocate_temp()而不是allocate_persistent() 正如mrry的回答所建议的那样吗？
这取决于您的用例以及您是否愿意释放内存。请参阅comments in the TensorFlow repo。如果 Op 是无状态的（大多数 Ops 都是），我建议在内核中使用allocate_output。 GPU 内存是一种稀有资源（对大多数人来说），所以我通常会释放我的东西。
很抱歉再次打扰，但我有一些后续问题希望您能提供帮助。 1. 您如何以这种方式分配用户定义的 C++ 类型？例如结构数组。 2. 静态内存怎么样？ int a[3] = {1,2,3}，我应该关心这个吗？
我不以这种方式分配结构。但我可能会将这些东西序列化为字节数组。最后一个例子是放在栈上的，所以通常应该没有问题。

【解决方案2】：

最佳做法是使用OpKernelContext::allocate_persistent() 方法以tensorflow::Tensor 的形式分配内存，该方法比对OpKernel::Compute() 的单个调用要长。它为设备使用适当的Allocator*，因此如果内核在 GPU 设备上运行，它将为该特定设备分配 GPU 内存，如果它在 CPU 设备上运行，它将分配 CPU 内存。

【讨论】：

感谢您的回答。可能我说的有点过分了，但是你能对静态内存分配（编译时）做一个简短的评论吗？例如，在 CUDA 内核中声明静态形状数组，这会在多 GPU 场景中引起问题吗？