使用多处理格式化大量图像需要更长的时间答案

【问题标题】：Formatting a large list of images is taking longer using multiprocessing使用多处理格式化大量图像需要更长的时间
【发布时间】：2019-06-27 20:36:30
【问题描述】：

我目前正在处理 15k 张图像，但这个数字可能会在某个时候增长到更多。我编写了一个函数，它对图像进行了一些更改，例如将它们转换为黑白、裁剪、调整大小然后展平它们。稍后我会将格式化的图像保存到 csv 文件中，以供以后与 tensorflow 一起使用。我正在使用多处理模块来利用 CPU 上的更多内核。似乎使用多处理需要更长的时间，然后使用 for 循环一次编辑单个图像。我还编写了同一程序的一个简单版本，该程序对一系列数字进行平方。为此使用多处理实际上更快。

将数据分成多个批次会更好吗？我写了一个生成器来给我不同的批次，但我无法让多处理来处理它。

比较使用多处理和顺序函数调用格式化图像的时间

            # comparing time for image formating using
            # sequential and multiprocessing
            # vonderasche
            # 2/3/2019

            import multiprocessing as mp
            import time
            import numpy as np
            import cv2
            import os
            import sys

            def my_format_images(image):
                ''' converts to BW, crops, resizes and then flattens the image'''

                image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

                height, width = image.shape

                if (height < width):
                    x_start = int((width - height) / 2)
                    x_end = height + x_start
                    image = image[0:height, x_start:x_end]

                elif (width < height):
                    y_start = int((height - width) / 2)
                    y_end = width + y_start
                    image = image[y_start:y_end, 0:width]

                image = cv2.resize(image, (100, 100))

                image = image.flatten()

                return image

            def load_images(path):
                '''loads images from a provided path'''

                print('loading images')
                image_list = []
                for root, dirs, files in os.walk(path):
                    for file in files:
                        if file.endswith(".jpg"):
                            img = cv2.imread(os.path.join(root, file))
                            image_list.append(img)
                    return image_list

            def main():

                path = 'images'
                images = load_images(path)

                print('total images loaded: ' + str(len(images)))

                # multiprocessing function call
                start_mp_timer = time.time()
                pool = mp.Pool(4)
                result = pool.map(my_format_images, images)
                end_mp_timer = time.time() - start_mp_timer

                # sequential function call
                sum_of_single_thread = []
                start_timer = time.time()
                for i in images:
                    num = my_format_images(i)
                    sum_of_single_thread.append(num)
                end_timer = time.time() - start_timer

                print('multiprocessing time: ' + ' {: 05.5f}'.format(end_mp_timer) + ' sequential time: ' +' {: 05.5f}'.format(end_timer))

            if __name__ == "__main__":
                main()
            main()

对一系列数字求平方以查看多处理是否有效的简单版本。

    # multiprocessing - test using numbers
    # vonderasche
    # 2/3/2019

    import multiprocessing as mp
    import time
    import os

    def square(x):
      ''' prints the current process id and returns the square'''
      print(os.getpid())
      return x**x

    def main():

      data = [4784, 2454, 34545, 54545,
                                  34545, 24545, 1454, 454542, 52221, 11242, 88478, 447511]

      # multiprocessing function call
      pool = mp.Pool(4)
      start_mp_timer = time.time()
      result = pool.map(square, data)
      end_mp_timer = time.time() - start_mp_timer


      #  sequential function call
      sum_of_single_thread = []

      start_timer = time.time()
      for i in data:
          num = square(i)
          sum_of_single_thread.append(num)
      end_timer = time.time() - start_timer

      print('multiprocessing time: ' + '{:05.5f}'.format(end_mp_timer))
      print('sequential time: ' + '{:05.5f}'.format(end_timer))

    if __name__ == "__main__":
      main()

【问题讨论】：

或许可以考虑加载时转换为灰度，而不是读取 RGB，造成 3 倍的内存压力，然后再转换为灰度？
你也在开始连续加载所有图像。您最好生成一个仅包含文件名的全局/列表（不加载它们）并在多处理下并行进行加载和调整大小。

标签： python python-3.x image list multiprocessing

【解决方案1】：

我认为您遇到了一个问题，即多处理会在创建子进程时复制父进程的内存。见Python multiprocessing memory usage。

为了确认，我建议使用两个程序：两个程序都在池中做一些数学运算，但是一个在创建池之前将一堆东西加载到内存中。我希望首先将一堆东西加载到内存中的那个具有较慢的多处理时间，即使池没有使用这些东西。

如果是这种情况，我的解决方案是在进程内部进行加载。

【讨论】：