如何在单个 gpu 上运行多个 keras 程序？答案

【问题标题】：How to run multiple keras programs on single gpu?如何在单个 gpu 上运行多个 keras 程序？
【发布时间】：2018-05-22 18:34:09
【问题描述】：

我正在开发一个 python 项目，我需要为每个数据集构建多个 Keras 模型。在这里，当我运行构建程序的 Keras 模型时，我使用了 10% 的 GPU（GTX 1050ti）。

我的问题是我可以 100% 使用我的 gpu 来减少时间吗？或者是否有可能在同一个 gpu 上运行多个程序？

我尝试在单个 gpu 上运行多个程序，但它没有并行运行，例如，当我运行单个 python 程序时，每个 epoch 需要 5 秒，而如果我为每个 epoch 运行 2 个程序，则持续时间会增加到 10 秒，运行多个程序的最佳方法是什么。

提前致谢！！

【问题讨论】：

标签： python tensorflow keras

【解决方案1】：

不确定是否有合适的方法来做这件事，但这个“gambiarra”似乎效果很好。

制作一个模型，将两个或多个模型并行连接在一起。唯一的缺点是：在并行训练和预测时需要相同数量的输入样本。

如何将两个模型与功能 API 模型并行使用：

input1 = Input(inputShapeOfModel1)
input2 = Input(inputShapeOfModel2)

output1 = model1(input1)
output2 = model2(input2) #it could be model1 again, using model1 twice in parallel. 

parallelModel = Model([input1,input2], [output1,output2])

您使用此模型进行训练和预测，同时传递并行输入和输出数据：

parallelModel.fit([x_train1, x_train2], [y_train1, y_train2], ...)

工作测试代码：

from keras.layers import *
from keras.models import Model, Sequential
import numpy as np

#simulating two "existing" models
model1 = Sequential()
model2 = Sequential()

#creating "existing" model 1
model1.add(Conv2D(10,3,activation='tanh', input_shape=(20,20,3)))
model1.add(Flatten())
model1.add(Dense(1,activation='sigmoid'))

#creating "existing" model 2
model2.add(Dense(20, input_shape=(2,)))
model2.add(Dense(3))


#part containing the proposed answer: joining the two models in parallel
inp1 = Input((20,20,3))
inp2 = Input((2,))

out1 = model1(inp1)
out2 = model2(inp2)

model = Model([inp1,inp2],[out1,out2])


#treat the new model as any other model
model.compile(optimizer='adam', loss='mse')

#dummy input data x and y, for models 1 and 2
x1 = np.ones((30,20,20,3))
y1 = np.ones((30,1))
x2 = np.ones((30,2))
y2 = np.ones((30,3))

#training the model and predicting
model.fit([x1,x2],[y1,y2], epochs = 50)
ypred1,ypred2 = model.predict([x1,x2])

print(ypred1.shape)
print(ypred2.shape)

高级解决方案 - 对数据进行分组以提高速度并匹配样本量

仍有进一步优化的空间，因为这种方法将在两个模型之间同步批次。因此，如果一个模型比另一个模型快得多，那么快模型会根据慢模型的速度进行调整。

此外，如果您有不同数量的批次，则需要单独训练/预测一些剩余数据。

如果您对输入数据进行分组，并在模型中使用一些自定义重塑和 Lambda 层，您也可以解决这些限制，您可以在开始时重塑批次维度，然后在结束时恢复它。

例如，如果x1 有 300 个样本，x2 有 600 个样本，您可以重塑输入和输出：

x2 = x2.reshape((300,2,....))
y2 = y2.reshape((300,2,....))

在model2 之前和之后，你使用：

#before
Lambda(lambda x: K.reshape(x,(-1,....))) #transforms in the inner's model input shape

#after
Lambda(lambda x: K.reshape(x, (-1,2,....))) #transforms in the grouped shape for output

其中.... 是原始输入和输出形状（不考虑batch_size）。

然后你需要思考哪个最好，分组数据同步数据大小或分组数据同步速度。

（与下一个解决方案相比的优势：您可以轻松按任意数字分组，例如 2、5、10、200.....）

高级解决方案 - 多次使用同一模型以实现双倍速度

您还可以并行使用同一模型两次，例如在此代码中。这可能会使其速度翻倍。

from keras.layers import *
from keras.models import Model, Sequential
#import keras.backend as K
import numpy as np
#import tensorflow as tf


#simulating two "existing" models
model1 = Sequential()
model2 = Sequential()

#model 1
model1.add(Conv2D(10,3,activation='tanh', input_shape=(20,20,3)))
model1.add(Flatten())
model1.add(Dense(1,activation='sigmoid'))

#model 2
model2.add(Dense(20, input_shape=(2,)))
model2.add(Dense(3))

#joining the models
inp1 = Input((20,20,3))

#two inputs for model 2 (the model we want to run twice as fast)
inp2 = Input((2,))
inp3 = Input((2,))

out1 = model1(inp1)
out2 = model2(inp2) #use model 2 once
out3 = model2(inp3) #use model 2 twice

model = Model([inp1,inp2,inp3],[out1,out2,out3])

model.compile(optimizer='adam', loss='mse')

#dummy data - remember to have two inputs for model 2, not repeated
x1 = np.ones((30,20,20,3))
y1 = np.ones((30,1))
x2 = np.ones((30,2)) #first input for model 2
y2 = np.ones((30,3)) #first output for model 2
x3 = np.zeros((30,2)) #second input for model 2
y3 = np.zeros((30,3)) #second output for model 2

model.fit([x1,x2,x3],[y1,y2,y3], epochs = 50)
ypred1,ypred2,ypred3 = model.predict([x1,x2,x3])

print(ypred1.shape)
print(ypred2.shape)
print(ypred3.shape)

与以前的解决方案相比的优势：操作数据和自定义重塑的麻烦更少。

【讨论】：

谢谢丹尼尔，我在功能 API 方面不是很好，但我会尝试你提供的解决方案。只是为了概述我的任务，我正在使用使用 LSTM 架构的顺序模型来预测时间序列。这是非常简单的代码。但是问题是我需要为每个数据集运行相同的程序来构建模型以及我需要在哪里预测每个数据集的数据。
您可以多次并行使用同一个模型，就像我回答的最后一部分一样。然后将数据分成两部分。
我真的建议大家学习funcitonal API，它并不难，而且创造了很多可能性。
有没有办法指定每个“内部模型”都在特定 GPU 上进行训练（假设至少有 2 个 GPU 可用）？
在模型定义的每个部分中使用 with tf.device：tensorflow.org/api_docs/python/tf/device - 我认为使用函数式 API 会更好，而不是使用 Sequential。