如何在不重复 ID 的情况下并行化 for 循环答案

【问题标题】：how to parallelize a for loop without repeating ID如何在不重复 ID 的情况下并行化 for 循环
【发布时间】：2018-03-27 19:34:56
【问题描述】：

我对 Python 有点陌生。我正在对一个网站进行索引并从中获取值，但是由于要索引的页面大约为 100k，因此需要很长时间。我想知道如何加快速度。我读到多线程可能是冲突的/不适用于此，多处理将是最好的开始方式。

这是我的代码示例：

def main():
    for ID in range(1, 100000):
        requests.get("example.com/?id=" + str(ID))
        #do stuff/print html elements off of url.

如果我这样做：

if __name__ == '__main__':
    for i in range(50):
        p = multiprocessing.Process(target=main)
        p.start()

它确实并行运行该函数，但我只希望每个进程抓取一个尚未被另一个进程抓取的 ID。如果我执行 p.join() 与没有多处理相比，它似乎根本没有提高速度，所以我不知道该怎么做。

【问题讨论】：

看起来您可能想要使用多处理 Pool 和 map 那个函数。与第一个示例in the docs 相同的方法开始。
如果你只是在做请求，requests-futures 对于带有线程池的异步请求可能比尝试使用multiprocessing 容易得多

标签： python python-3.x parallel-processing multiprocessing

【解决方案1】：

这是一个基于concurrent.futures module.的示例

import concurrent.futures

# Retrieve a single page and report the URL and contents
def load_url(page_id, timeout):
   requests.get("example.com/?id=" + str(page_id))
   return do_stuff(request)  #do stuff on html elements off of url.  


# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, page_id, 60): page_id for page_id in range(1,100000)}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

【讨论】：

两者的想法相同 :) requests-futures 是这种方法的一个非常简单的包装器，类似于 requests 是 urllib 的方式