高效地多处理字符串数组答案

【问题标题】：Multiprocessing an array of strings efficiently高效地多处理字符串数组
【发布时间】：2019-05-02 08:25:22
【问题描述】：

我有一个需要处理的字符串数组。由于字符串可以独立处理，所以我是并行执行的：

import multiprocessing
import numpy as np

def func(x):
    ls = ["this", "is"]
    return [i.upper() for i in x.split(' ') if i not in ls]

arr = np.asarray(["this is a test", "this is not a test", "see my good example"])
pool = multiprocessing.Pool(processes=2)
tst = pool.map(func, arr)
pool.close()

我的问题如下：在减少内存使用和 CPU 时间方面，有什么明显的方法可以改进我的代码吗？比如

在 func 中使用 numpy 数组？
使用 Python 列表而不是 numpy 数组？
...?

【问题讨论】：

你为什么使用 numpy 数组？
@roganjosh 我的印象是 Python 列表更受欢迎，因为它们更有效（那么我可能错了......？）
是的 :) Numpy 数组在本质上并不比列表更有效。在很多情况下（例如，在循环中追加）它们速度较慢。只有在您对它们使用 numpy 方法时，它们才真正起作用；然后它们可以超过列表操作几个数量级。但并不是所有事情都可以用数组来完成
如果ls 在您的实际问题中实际上相当大，您首先要尝试将其转换为set
@roganjosh 这是+100 字，所以我肯定会这样做。 arr也很长，一百万多。

标签： python arrays string performance multiprocessing

【解决方案1】：

您可以使用 numpy frompyfunc 向量化整个执行。这比原生 Python 实现要快得多。

import numpy as np
import functools


def func(x):    
    ls = ["this", "is"]
    print( [i.upper() for i in x.split(',') if i not in ls])


x = np.array(["this is a test", "this is not a test", "see my good example"])
np.frompyfunc(func,1,1)(x)

【讨论】：

会比使用Joblib 或multiprocessing 并行化更快吗？
这就是文档所说的向量化比正常实现更快，numpy vectorize 是原始函数 frompyfunc() 的包装器