【发布时间】:2019-06-27 20:36:30
【问题描述】:
我目前正在处理 15k 张图像,但这个数字可能会在某个时候增长到更多。我编写了一个函数,它对图像进行了一些更改,例如将它们转换为黑白、裁剪、调整大小然后展平它们。稍后我会将格式化的图像保存到 csv 文件中,以供以后与 tensorflow 一起使用。我正在使用多处理模块来利用 CPU 上的更多内核。似乎使用多处理需要更长的时间,然后使用 for 循环一次编辑单个图像。我还编写了同一程序的一个简单版本,该程序对一系列数字进行平方。为此使用多处理实际上更快。
将数据分成多个批次会更好吗?我写了一个生成器来给我不同的批次,但我无法让多处理来处理它。
比较使用多处理和顺序函数调用格式化图像的时间
# comparing time for image formating using
# sequential and multiprocessing
# vonderasche
# 2/3/2019
import multiprocessing as mp
import time
import numpy as np
import cv2
import os
import sys
def my_format_images(image):
''' converts to BW, crops, resizes and then flattens the image'''
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
height, width = image.shape
if (height < width):
x_start = int((width - height) / 2)
x_end = height + x_start
image = image[0:height, x_start:x_end]
elif (width < height):
y_start = int((height - width) / 2)
y_end = width + y_start
image = image[y_start:y_end, 0:width]
image = cv2.resize(image, (100, 100))
image = image.flatten()
return image
def load_images(path):
'''loads images from a provided path'''
print('loading images')
image_list = []
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(".jpg"):
img = cv2.imread(os.path.join(root, file))
image_list.append(img)
return image_list
def main():
path = 'images'
images = load_images(path)
print('total images loaded: ' + str(len(images)))
# multiprocessing function call
start_mp_timer = time.time()
pool = mp.Pool(4)
result = pool.map(my_format_images, images)
end_mp_timer = time.time() - start_mp_timer
# sequential function call
sum_of_single_thread = []
start_timer = time.time()
for i in images:
num = my_format_images(i)
sum_of_single_thread.append(num)
end_timer = time.time() - start_timer
print('multiprocessing time: ' + ' {: 05.5f}'.format(end_mp_timer) + ' sequential time: ' +' {: 05.5f}'.format(end_timer))
if __name__ == "__main__":
main()
main()
对一系列数字求平方以查看多处理是否有效的简单版本。
# multiprocessing - test using numbers
# vonderasche
# 2/3/2019
import multiprocessing as mp
import time
import os
def square(x):
''' prints the current process id and returns the square'''
print(os.getpid())
return x**x
def main():
data = [4784, 2454, 34545, 54545,
34545, 24545, 1454, 454542, 52221, 11242, 88478, 447511]
# multiprocessing function call
pool = mp.Pool(4)
start_mp_timer = time.time()
result = pool.map(square, data)
end_mp_timer = time.time() - start_mp_timer
# sequential function call
sum_of_single_thread = []
start_timer = time.time()
for i in data:
num = square(i)
sum_of_single_thread.append(num)
end_timer = time.time() - start_timer
print('multiprocessing time: ' + '{:05.5f}'.format(end_mp_timer))
print('sequential time: ' + '{:05.5f}'.format(end_timer))
if __name__ == "__main__":
main()
【问题讨论】:
-
或许可以考虑加载时转换为灰度,而不是读取 RGB,造成 3 倍的内存压力,然后再转换为灰度?
-
你也在开始连续加载所有图像。您最好生成一个仅包含文件名的全局/列表(不加载它们)并在多处理下并行进行加载和调整大小。
标签: python python-3.x image list multiprocessing