【发布时间】:2015-11-17 13:52:07
【问题描述】:
我正在尝试在具有 64 个 CPU 的 CentOS 7 机器上同时运行多个 TensorFlow 会话。我的同事报告说,他可以使用以下两个代码块在他的机器上使用 4 核产生并行加速:
mnist.py
import numpy as np
import input_data
from PIL import Image
import tensorflow as tf
import time
def main(randint):
print 'Set new seed:', randint
np.random.seed(randint)
tf.set_random_seed(randint)
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
# Setting up the softmax architecture
x = tf.placeholder("float", [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
# Setting up the cost function
y_ = tf.placeholder("float", [None, 10])
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
# Initialization
init = tf.initialize_all_variables()
sess = tf.Session(
config=tf.ConfigProto(
inter_op_parallelism_threads=1,
intra_op_parallelism_threads=1
)
)
sess.run(init)
for i in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels})
if __name__ == "__main__":
t1 = time.time()
main(0)
t2 = time.time()
print "time spent: {0:.2f}".format(t2 - t1)
parallel.py
import multiprocessing
import numpy as np
import mnist
import time
t1 = time.time()
p1 = multiprocessing.Process(target=mnist.main,args=(np.random.randint(10000000),))
p2 = multiprocessing.Process(target=mnist.main,args=(np.random.randint(10000000),))
p3 = multiprocessing.Process(target=mnist.main,args=(np.random.randint(10000000),))
p1.start()
p2.start()
p3.start()
p1.join()
p2.join()
p3.join()
t2 = time.time()
print "time spent: {0:.2f}".format(t2 - t1)
特别是,他说他观察到了
Running a single process took: 39.54 seconds
Running three processes took: 54.16 seconds
但是,当我运行代码时:
python mnist.py
==> Time spent: 5.14
python parallel.py
==> Time spent: 37.65
如您所见,我使用多处理显着降低了速度,而我的同事却没有。有没有人知道为什么会发生这种情况以及可以做些什么来解决它?
编辑
这是一些示例输出。请注意,加载数据似乎是并行发生的,但训练单个模型在输出中具有非常顺序的外观(可以通过在程序执行时查看top 中的 CPU 使用情况来验证)
#$ python parallel.py
Set new seed: 9672406
Extracting MNIST_data/train-images-idx3-ubyte.gz
Set new seed: 4790824
Extracting MNIST_data/train-images-idx3-ubyte.gz
Set new seed: 8011659
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 1
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 1
0.9136
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 1
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 1
0.9149
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 1
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 1
0.8931
time spent: 41.36
另一个编辑
假设我们希望确认问题似乎与 TensorFlow 有关,而不是与多处理有关。我用大循环替换了mnist.py的内容如下:
def main(randint):
c = 0
for i in xrange(100000000):
c += i
输出:
#$ python mnist.py
==> time spent: 5.16
#$ python parallel.py
==> time spent: 4.86
因此我认为这里的问题不在于多处理本身。
【问题讨论】:
-
你在使用 docker 吗?我必须让它访问我所有的 cpus
-
不,我没有使用 Docker
标签: python parallel-processing python-multiprocessing tensorflow