【发布时间】:2019-06-19 13:17:12
【问题描述】:
我使用 pyppeteer 在 python 中创建了一个脚本来收集遍历网站多个页面的不同机构的名称。我想做的是让我的脚本在解析每个页面的名称时单击下一页按钮来遍历不同的页面。
我尝试过的:
import asyncio
from pyppeteer import launch
url = "https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx"
async def fetch_table(link):
browser = await launch(headless=False)
[page] = await browser.pages()
await page.goto(link)
while True:
await page.waitForSelector("h1.faqsno-heading", {'visible':True})
for item in await page.querySelectorAll("h1.faqsno-heading"):
name = await item.querySelectorEval("div[id^='arrowex']",'e => e.innerText')
print(name)
try:
elem = await page.querySelector("[title='Next Page']")
await elem.click()
except Exception: break
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_table(url))
上面的脚本运行良好,直到遇到 5 到 10 页之间的错误。不过,页面可能会有所不同。
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 23, in <module>
loop.run_until_complete(fetch_table(url))
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\asyncio\base_events.py", line 568, in run_until_complete
return future.result()
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 11, in fetch_table
await page.waitForSelector("h1.faqsno-heading", {'visible':True})
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\frame_manager.py", line 834, in __await__
raise result
pyppeteer.errors.TimeoutError: Waiting for selector "h1.faqsno-heading" failed: timeout 30000ms exceeds.
但是,当我进行小改动并像这样尝试时,我可以看到脚本也可以完成它的工作,直到遇到以下错误:
try:
await page.click("[title='Next Page']")
except Exception: break
我收到以下错误:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 48, in <module>
loop.run_until_complete(fetch_table(url))
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\asyncio\base_events.py", line 568, in run_until_complete
return future.result()
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 37, in fetch_table
await page.waitForSelector("h1.faqsno-heading", {'visible':True})
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\frame_manager.py", line 832, in __await__
result = yield from self.promise
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\frame_manager.py", line 859, in rerun
*self._args,
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
_rewriteError(e)
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\execution_context.py", line 239, in _rewriteError
raise error
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error Runtime.callFunctionOn: Target closed.
如何让我的脚本继续运行,直到执行所有点击?
【问题讨论】:
-
第二个错误是因为您的浏览器已关闭,可能是由于第一个错误..至于第一个错误,您可以通过将此选项传递给
{timeout: 30000}来设置waitForSelector的超时时间......所以它会等待更长时间的元素(顺便说一句,我不是python程序员......你必须搜索我写的python等价物) -
在这个库中,Python 实现与 puppeteer 几乎相同。我尝试使用
await page.waitForSelector("h1.faqsno-heading",{'timeout':30000}),但仍然得到相同的错误,这意味着错误包含'userGesture': True, -
30000 是默认值,您可以在第一个错误中看到它说
failed: timeout 30000ms exceeds,您需要将其更改为更高的数字,以便等待更长时间......您也可以输入 @987654332 @ 那里,所以它不会有超时并等待,只要它需要
标签: python python-3.x web-scraping puppeteer pyppeteer