【发布时间】:2022-01-24 03:57:54
【问题描述】:
目标:
我正在尝试同时抓取多个 URL。我不想同时发出太多请求,所以我使用this solution 来限制它。
问题:
正在为所有任务发出请求,而不是一次为有限的数量发出请求。
精简代码:
async def download_all_product_information():
# TO LIMIT THE NUMBER OF CONCURRENT REQUESTS
async def gather_with_concurrency(n, *tasks):
semaphore = asyncio.Semaphore(n)
async def sem_task(task):
async with semaphore:
return await task
return await asyncio.gather(*(sem_task(task) for task in tasks))
# FUNCTION TO ACTUALLY DOWNLOAD INFO
async def get_product_information(url_to_append):
url = 'https://www.amazon.com.br' + url_to_append
print('Product Information - Page ' + str(current_page_number) + ' for category ' + str(
category_index) + '/' + str(len(all_categories)) + ' in ' + gender)
source = await get_source_code_or_content(url, should_render_javascript=True)
time.sleep(random.uniform(2, 5))
return source
# LOOP WHERE STUFF GETS DONE
for current_page_number in range(1, 401):
for gender in os.listdir(base_folder):
all_tasks = []
# check all products in the current page
all_products_in_current_page = open_list(os.path.join(base_folder, gender, category, current_page))
for product_specific_url in all_products_in_current_page:
current_task = asyncio.create_task(get_product_information(product_specific_url))
all_tasks.append(current_task)
await gather_with_concurrency(random.randrange(8, 15), *all_tasks)
async def main():
await download_all_product_information()
# just to make sure there are not any problems caused by two event loops
if asyncio.get_event_loop().is_running(): # only patch if needed (i.e. running in Notebook, Spyder, etc)
import nest_asyncio
nest_asyncio.apply()
# for asynchronous functionality
if __name__ == '__main__':
asyncio.run(main())
我做错了什么?谢谢!
【问题讨论】:
-
为什么在
async函数中调用gather_with_concurrency和asyncio.run?就等着吧。并使用asyncio.run(main())而不是老式的loop.xxx。最重要的是:asyncio.run的工作方式与loop.run_until_complete(main())一样,但在您的asyncio.run(await gather_with_concurrency(..中运行您等待的东西 -
非常感谢您的回复!我更新了代码,但它仍然无法正常工作。我没有正确更新吗?
标签: python web-scraping python-asyncio python-requests-html