Python中的同步多线程答案

【问题标题】：Synchronise muti-threads in PythonPython中的同步多线程
【发布时间】：2014-10-07 10:50:22
【问题描述】：

以下代码中的 BrokenLinkTest 类执行以下操作。

获取网页网址
查找网页中的所有链接
同时获取链接的标题（这样做是为了检查链接是否断开）
收到所有标头后打印“完成”。

from bs4 import BeautifulSoup
import requests

class BrokenLinkTest(object):

    def __init__(self, url):
        self.url = url
        self.thread_count = 0
        self.lock = threading.Lock()

    def execute(self):
        soup = BeautifulSoup(requests.get(self.url).text)
        self.lock.acquire()
        for link in soup.find_all('a'):
            url = link.get('href')
            threading.Thread(target=self._check_url(url))
        self.lock.acquire()

    def _on_complete(self):
        self.thread_count -= 1
        if self.thread_count == 0: #check if all the threads are completed
            self.lock.release()
            print "completed"

    def _check_url(self, url):
        self.thread_count += 1
        print url
        result = requests.head(url)
        print result
        self._on_complete()


BrokenLinkTest("http://www.example.com").execute()

并发/同步部分能否以更好的方式完成。我使用 threading.Lock 做到了。这是我第一次尝试使用 python 线程。

【问题讨论】：

查看docs.python.org/2/library/multiprocessing.html 中的pool.map。它会让你的代码变得更简单。
不清楚你想要什么，你拥有什么，以及你期望如何通过你所做的事情到达那里。请给出示例输入和输出需要并解释你一直在尝试做什么来实现这一点。
print 不是线程安全的。这会弄乱输出。所有这些线程都会随机调用print
查看代码示例，这些示例展示了如何在有/没有多个线程的情况下进行多个并发连接并限制（同步）它们：Limiting number of processes in multiprocessing python、Problem with multi threaded Python app and socket connections、Brute force basic http authorization using httplib and multiprocessing、Is there a way to run cpython on a diffident thread without risking a crash?。

标签： python multithreading synchronization

【解决方案1】：

def execute(self):
    soup = BeautifulSoup(requests.get(self.url).text)
    threads = []
    for link in soup.find_all('a'):
        url = link.get('href')
        t = threading.Thread(target=self._check_url, args=(url,))
        t.start()
        threads.append(t)
    for thread in threads:
        thread.join()

您可以使用join 方法等待所有线程完成。

注意我还添加了一个开始调用，并将绑定的方法对象传递给目标参数。在您的原始示例中，您在主线程中调用 _check_url 并将返回值传递给目标参数。

【讨论】：

【解决方案2】：

Python 中的所有线程都在同一个内核上运行，因此您不会通过这种方式获得任何性能。另外——目前还不清楚实际发生了什么？

您实际上并没有启动线程，您只是在初始化它
线程本身除了减少线程计数之外什么都不做

如果您的程序将工作交付给 IO（发送请求、写入文件等），而其他线程可以同时工作，您可能只会在基于线程的场景中获得性能。

【讨论】：

Python 线程是真正的操作系统线程，它们可以在多个 CPU 上运行。 CPython 中的纯 Python 代码受全局解释器锁 (GIL) 保护，因此当时只有一个 Python 线程处于活动状态，但 GIL 可以在 I/O（和其他阻塞系统调用）期间释放，ctypes 默认释放 GIL，许多 C numpy、lxml、regex 等扩展模块可以在计算期间释放 GIL：相关部分是 requests.get() 可能是 I/O 绑定的，如果安装了 BeautifulSoup 可能会使用 lxml。