当任何线程完成任务时终止多个线程答案

【问题标题】：Terminate multiple threads when any thread completes a task当任何线程完成任务时终止多个线程
【发布时间】：2011-06-08 22:53:44
【问题描述】：

我对 python 和线程都是新手。我编写了 python 代码，它充当网络爬虫并在网站上搜索特定关键字。我的问题是，如何使用线程同时运行我的类的三个不同实例。当其中一个实例找到关键字时，所有三个实例都必须关闭并停止抓取网络。这是一些代码。

class Crawler:
      def __init__(self):
            # the actual code for finding the keyword 

 def main():  
        Crawl = Crawler()

 if __name__ == "__main__":
        main()

如何使用线程让 Crawler 同时进行三种不同的爬取？

【问题讨论】：

标签： python multithreading

【解决方案1】：

似乎没有一种（简单的）方法可以终止 Python 中的线程。

这是一个并行运行多个 HTTP 请求的简单示例：

import threading

def crawl():
    import urllib2
    data = urllib2.urlopen("http://www.google.com/").read()

    print "Read google.com"

threads = []

for n in range(10):
    thread = threading.Thread(target=crawl)
    thread.start()

    threads.append(thread)

# to wait until all three functions are finished

print "Waiting..."

for thread in threads:
    thread.join()

print "Complete."

有了额外的开销，您可以使用更强大的multi-process 方法，并允许您终止类似线程的进程。

我已经扩展了示例以使用它。希望对您有所帮助：

import multiprocessing

def crawl(result_queue):
    import urllib2
    data = urllib2.urlopen("http://news.ycombinator.com/").read()

    print "Requested..."

    if "result found (for example)":
        result_queue.put("result!")

    print "Read site."

processs = []
result_queue = multiprocessing.Queue()

for n in range(4): # start 4 processes crawling for the result
    process = multiprocessing.Process(target=crawl, args=[result_queue])
    process.start()
    processs.append(process)

print "Waiting for result..."

result = result_queue.get() # waits until any of the proccess have `.put()` a result

for process in processs: # then kill them all off
    process.terminate()

print "Got result:", result

【讨论】：

感谢您的回答。 join 语句究竟做了什么？以及如何实施多流程方法？
join基本上是说，在这里等到线程（run方法）停止处理。
.join() 一直等到线程执行完毕——所以它不能用来停止爬虫，只能用来在爬取完成后同步你的代码。我已经在我的帖子中添加了一个多进程示例（我不记得我脑海中的 API：P）。
您更新的包含多处理的评论似乎运行良好，只是进程没有被终止。程序在 result = result_queue.get() 处挂起。知道我做错了什么吗？？
没关系，你的答案有效，这正是我想要的。谢谢！

【解决方案2】：

启动线程很简单：

thread = threading.Thread(function_to_call_inside_thread)
thread.start()

创建一个事件对象以在您完成时通知：

event = threading.Event()
event.wait() # call this in the main thread to wait for the event
event.set() # call this in a thread when you are ready to stop

事件触发后，您需要向爬虫添加 stop() 方法。

for crawler in crawlers:
    crawler.stop()

然后在线程上调用join

thread.join() # waits for the thread to finish

如果您进行大量此类编程，您会想要查看 eventlet 模块。它允许您编写“线程”代码而没有线程的许多缺点。

【讨论】：

【解决方案3】：

首先，如果您是 python 新手，我不建议您面对线程。 习惯这种语言，然后处理多线程。

话虽如此，如果你的目标是并行化（你说“同时运行”），你应该知道在 python 中（或至少在默认实现中，CPython）多线程不会真正并行运行，即使有多个处理器内核可用。阅读 GIL（全局解释器锁）了解更多信息。

最后，如果您还想继续，请查看Python documentation 中的线程模块。我想说 Python 的文档和参考文献一样好，有很多例子和解释。

【讨论】：

"即使有多个处理器内核可用，多线程也不会真正并行运行。"在这种情况下，这过于简单且无益。许多阻塞操作，如 HTTP 请求，释放 GIL 并且将并行运行。简单的线程在这里可能就足够了。

【解决方案4】：

对于这个问题，您可以使用线程模块（正如其他人所说，由于 GIL，它不会执行真正的线程）或多处理模块（取决于您使用的 Python 版本）。它们有非常相似的 API，但我推荐多处理，因为它更 Pythonic，而且我发现使用 Pipes 在进程之间进行通信非常容易。

您将希望拥有将创建您的进程的主循环，并且这些进程中的每一个都应该运行您的爬虫，并有一个返回主线程的管道。您的进程应该在管道上侦听消息，进行一些爬取，并在找到某些内容时通过管道发送回消息（在终止之前）。您的主循环应该遍历每个管道返回它，监听这个“找到东西”消息。一旦它听到该消息，它应该通过管道将其重新发送到其余进程，然后等待它们完成。

更多信息可以在这里找到：http://docs.python.org/library/multiprocessing.html

【讨论】：

只有在 CPU 受限的情况下才真正有意义地使用多处理模块。
如果您不使用它，是否有理由不使用它？注意：我并不是说你必须使用它。您可以使用 threading 模块实现大致相同的解决方案。
同意，这里没有真正的理由使用多处理，只是额外的头痛。
启动额外进程的额外开销，增加主机和客户端程序之间通信的复杂性，降低与其他 python 实现的兼容性，在 linux/windows 上的行为略有不同。
诚实的问题 - 怎么比使用线程更让人头疼？两者都用过后，我发现它们的疼痛程度相当。

【解决方案5】：

首先，线程不是 Python 中的解决方案。由于 GIL，线程不能并行工作。因此，您可以使用 multiprocessing 来处理此问题，并且您会受到处理器内核数量的限制。

你的工作目标是什么？你想要一个爬虫吗？或者你有一些学术目标（学习线程和 Python 等）？

还有一点，Crawl比其他程序浪费更多的资源，那么你的crawl有什么卖点呢？

【讨论】：