【问题标题】:Unable to let my script perform all the clicks on the next page button无法让我的脚本执行下一页按钮上的所有点击
【发布时间】:2019-06-19 13:17:12
【问题描述】:

我使用 pyppeteer 在 python 中创建了一个脚本来收集遍历网站多个页面的不同机构的名称。我想做的是让我的脚本在解析每个页面的名称时单击下一页按钮来遍历不同的页面。

website address

我尝试过的:

import asyncio
from pyppeteer import launch

url = "https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx"

async def fetch_table(link):
    browser = await launch(headless=False)
    [page] = await browser.pages()
    await page.goto(link)
    while True:
        await page.waitForSelector("h1.faqsno-heading", {'visible':True})
        for item in await page.querySelectorAll("h1.faqsno-heading"):
            name = await item.querySelectorEval("div[id^='arrowex']",'e => e.innerText')
            print(name)

        try:
            elem =  await page.querySelector("[title='Next Page']")
            await elem.click()
        except Exception: break

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(fetch_table(url))

上面的脚本运行良好,直到遇到 5 到 10 页之间的错误。不过,页面可能会有所不同。

Traceback (most recent call last):
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 23, in <module>
    loop.run_until_complete(fetch_table(url))
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\asyncio\base_events.py", line 568, in run_until_complete
    return future.result()
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 11, in fetch_table
    await page.waitForSelector("h1.faqsno-heading", {'visible':True})
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\frame_manager.py", line 834, in __await__
    raise result
pyppeteer.errors.TimeoutError: Waiting for selector "h1.faqsno-heading" failed: timeout 30000ms exceeds.

但是,当我进行小改动并像这样尝试时,我可以看到脚本也可以完成它的工作,直到遇到以下错误:

try:
    await page.click("[title='Next Page']")
except Exception: break

我收到以下错误:

Traceback (most recent call last):
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 48, in <module>
    loop.run_until_complete(fetch_table(url))
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\asyncio\base_events.py", line 568, in run_until_complete
    return future.result()
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 37, in fetch_table
    await page.waitForSelector("h1.faqsno-heading", {'visible':True})
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\frame_manager.py", line 832, in __await__
    result = yield from self.promise
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\frame_manager.py", line 859, in rerun
    *self._args,
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
    _rewriteError(e)
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\execution_context.py", line 239, in _rewriteError
    raise error
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
    'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error Runtime.callFunctionOn: Target closed.

如何让我的脚本继续运行,直到执行所有点击?

【问题讨论】:

  • 第二个错误是因为您的浏览器已关闭,可能是由于第一个错误..至于第一个错误,您可以通过将此选项传递给{timeout: 30000} 来设置waitForSelector 的超时时间......所以它会等待更长时间的元素(顺便说一句,我不是python程序员......你必须搜索我写的python等价物)
  • 在这个库中,Python 实现与 puppeteer 几乎相同。我尝试使用await page.waitForSelector("h1.faqsno-heading",{'timeout':30000}),但仍然得到相同的错误,这意味着错误包含'userGesture': True,
  • 30000 是默认值,您可以在第一个错误中看到它说 failed: timeout 30000ms exceeds ,您需要将其更改为更高的数字,以便等待更长时间......您也可以输入 @987654332 @ 那里,所以它不会有超时并等待,只要它需要

标签: python python-3.x web-scraping puppeteer pyppeteer


【解决方案1】:

请注意,您尝试抓取的网站有数百页!我不想让我的系统卡住很长时间 运行过程。相反,我尝试了 slot=20 个页面,它似乎正在工作。您可以更改插槽数以自己进行实验。 我正在使用 python 3.6,websockets 6.0。我在 Windows 8.1 上。 我添加了几行代码来限制页数。除此之外,我还添加了 await page.waitForSelector("[title='Next Page']", {'visible':True}) 在几个地方。

这里是代码

import asyncio
from pyppeteer import launch

url = "https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx"

async def fetch_table(link):
    browser = await launch(headless=False)
    [page] = await browser.pages()
    await page.goto(link)
    slots=20 # change here for number of pages you want to scrape
    i=0
    while True:
        i=i+1
        if(i>slots):
           await page.waitForSelector("[title='Next Page']", {'visible':True})
           break
        await page.waitForSelector("h1.faqsno-heading", {'visible':True})
        for item in await page.querySelectorAll("h1.faqsno-heading"):
            name = await item.querySelectorEval("div[id^='arrowex']",'e => e.innerText')
            print(name)

        try:
            await page.waitForSelector("[title='Next Page']", {'visible':True})
            elem =  await page.querySelector("[title='Next Page']")
            await elem.click()
        except Exception: break


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(fetch_table(url))

-在第 20 页左右输出

(testenv) C:\Py\pypuppeteer1>python stack5.py
....
....
SHREE SUBRAHMANYA VANGMAYEE PARISHAD, GOAAAPTS2410M
SHREE SUBRAHMANYA VANGMAYEE PARISHAD, GOAAAPTS2410M
WORD FOR THE WORLD FELLOWSHIPAAAAW6295Q
JANA SEVA TRUSTAACTJ0594Q
VAGDEVI VILAS EDUCATIONAL AND CHARITABLE TRUSTAABTV8264G
NCORE IMPACT FOUNDATIONAAFCN9985K
M V M EDUCATIONAL TRUSTAACTM5633K
SOCIETY FOR BETTERMENT OF EDUCATIONAAHAS9354D
SWASTIKAM CHARITABLE TRUSTAAJTS9298K
M/S SANKALP YUVA PRERIT SANVARDHAN BAHUUDDESHIYA SANSTHAAAITS8452J
TRAILOKYA BOUDHA MAHASANGHA SAHAYYAK GAN NAGPURAAABT2581K
MISSIONAL YATRA INDIA (MY INDIA) CHARITABLE TRUSTAAOTM9109M
VRUNDAVAN SHIKSHAN VA BAHUUDDESHIYA SANSTHAAABAV6403C
SHRI JAGDAMBA GOVIGYAN ANUSANDHAN KENDRAAAQTS8474C
SUSHILABAI DEUSKAR PRATISHTHANAALTS8647L
AMRAVATI DISTRICT OPTHALMIC SOCIETYAAETA8499F
ALUMNI ASSOCIATION OF INDIRA GANDHI GOVERNMENT MEDICAL COLLEGE NAGPURAAGTA1367C
VIDYA NIDHI NAGPURAABTN4351L
LATE RAJSINGH DUNGAPUR MEMORIAL FOUNDATIONAABTL5457B
ARTHIK DRUSTYA MAGASVARGIYA SAMAJ SHIKSHAN SANSTHAAACTA6288L
SPARSHAADAS4064Q
LATE PADMADEVI R. MALOO FOUNDATIONAAATL4181B
VISHWARACHNA GRAMINS VIKAS SANSTHAAAATV5359D

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2020-12-03
    • 1970-01-01
    • 1970-01-01
    • 2019-07-30
    • 1970-01-01
    • 2023-04-02
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多