【发布时间】:2022-01-05 08:36:20
【问题描述】:
目前在我的代码中,我正在下载 pdf(1 页有 10 个 pdf),它有大约 900 页,所以目前我正在使用 9000 个 pdf threading,它在 1 小时内需要 1400 个 pdf,请帮助改进我的代码
import requests
from bs4 import BeautifulSoup as bs
from concurrent.futures import ThreadPoolExecutor
def writepdf(k, v):
path = r"C:\Users\deepak jain\Desktop\spectra"
with requests.Session() as session:
with open(f'{path}/{k}.pdf', 'wb') as f:
with session.get(v, stream=True) as r:
for data in r.iter_content():
f.write(data)
def main():
with requests.Session() as s:
current_page = 1
end_number = 900
threads = []
with ThreadPoolExecutor() as executor:
while current_page <= end_number:
r = s.get(f'https://bidplus.gem.gov.in/bidlists?bidlists&page_no={current_page}')
r.raise_for_status()
soup = bs(r.content, 'lxml')
for i in soup.select('.bid_no > a'):
k = i.text.strip().replace('/', '_')
v = f'https://bidplus.gem.gov.in{i["href"]}'
threads.append(executor.submit(writepdf, k, v))
if current_page == 1:
num_pages = int(soup.select_one('.pagination li:last-of-type > a')['data-ci-pagination-page'])
end_number = min(end_number, num_pages)
current_page += 1
for t in threads:
t.result()
if __name__ == '__main__':
main()
【问题讨论】:
-
在这个例子中我没有看到任何线程的使用。
-
您有两个部分正在执行请求,但第一个部分仍在按顺序完成,因此可以通过线程化来完成一些改进。
-
另外,我已经完成了足够多的异步操作,知道如何告诉你这样做,但我知道这种类型的 Web 请求模式非常适合它,并且可能比线程更高效。
标签: python python-3.x multithreading performance asynchronous