【发布时间】:2020-06-22 16:02:24
【问题描述】:
我正在做一个解析来自很多网站的数据的项目。我的大部分代码都已完成,所以我期待使用 asyncio 来消除 I/O 等待,但我仍然想测试线程如何工作,无论好坏。为此,我编写了一些简单的代码来向 100 个网站发出请求。顺便说一句,我为此使用 requests_html 库,幸运的是它也支持异步请求。
asyncio 代码如下:
import requests
import time
from requests_html import AsyncHTMLSession
aio_session = AsyncHTMLSession()
urls = [...] # 100 urls
async def fetch(url):
try:
response = await aio_session.get(url, timeout=5)
status = 200
except requests.exceptions.ConnectionError:
status = 404
except requests.exceptions.ReadTimeout:
status = 408
if status == 200:
return {
'url': url,
'status': status,
'html': response.html
}
return {
'url': url,
'status': status
}
def extract_html(urls):
tasks = []
for url in urls:
tasks.append(lambda url=url: fetch(url))
websites = aio_session.run(*tasks)
return websites
if __name__ == "__main__":
start_time = time.time()
websites = extract_html(urls)
print(time.time() - start_time)
执行时间(多次测试):
13.466366291046143
14.279950618743896
12.980706453323364
但是
如果我用threading 运行一个示例:
from queue import Queue
import requests
from requests_html import HTMLSession
from threading import Thread
import time
num_fetch_threads = 50
enclosure_queue = Queue()
html_session = HTMLSession()
urls = [...] # 100 urls
def fetch(i, q):
while True:
url = q.get()
try:
response = html_session.get(url, timeout=5)
status = 200
except requests.exceptions.ConnectionError:
status = 404
except requests.exceptions.ReadTimeout:
status = 408
q.task_done()
if __name__ == "__main__":
for i in range(num_fetch_threads):
worker = Thread(target=fetch, args=(i, enclosure_queue,))
worker.setDaemon(True)
worker.start()
start_time = time.time()
for url in urls:
enclosure_queue.put(url)
enclosure_queue.join()
print(time.time() - start_time)
执行时间(多次测试):
7.476433515548706
6.786043643951416
6.717151403427124
我不明白的事情.. 两个库都用于解决 I/O 问题,但为什么线程更快?我增加的线程数越多,它使用的资源就越多,但速度要快得多。有人可以向我解释一下为什么在我的示例中线程比 asyncio 快吗?
提前致谢。
【问题讨论】:
-
async-io 代码中的“websites = extract_html(urls:100])”行似乎搞砸了。
-
@Roy2012 已修复,粘贴代码时忘记关闭括号。
标签: python multithreading python-asyncio