【问题标题】:Trouble Navigating To the Next Page Using Python Selenium使用 Python Selenium 导航到下一页时遇到问题
【发布时间】:2021-07-10 04:07:24
【问题描述】:

我是网络抓取的新手;我正在尝试从site 中获取信息水公用设施。我目前能够通过下拉菜单成功浏览每个区域,并访问第一页。在转到下一个区域之前,我目前无法成功导航到所有页面的下一页。页面导航栏是一个没有“下一步”按钮的列表,我目前尝试使用范围遍历列表。当我得到 len 时,我没有得到正确的列表范围。就目前而言,我只能转到每个区域的第一页。即使在尝试寻找类似问题的答案之后,我仍在努力弄清楚我做错了什么或要考虑什么。对此的任何帮助将不胜感激。

谢谢!

这是我当前的代码(我没有抓取,专注于导航页面):

import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException

url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Firefox()
browser.get(url)
time.sleep(3)
print("Retriving the site...")

# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada)', 'Middle East and Northern Africa', 'South Asia']


for region in regions:
   # Select all options from drop down menu
   selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))

   print("Now constructing output for: " + region)

   # Select table and wait for data to populate
   selectOption.select_by_visible_text(region)

   time.sleep(4)
   
   list_of_table_pages = browser.find_element_by_xpath('//*[@id="MainContent_gvUtilities"]/tbody/tr[52]/td/ul')
   no_pages = len(list_of_table_pages.find_elements_by_xpath("//li"))

   print(("No of table pages to be scraped are: %d") %no_pages)
   
   print("Outputing data into "+ region +".csv...")

   all_table_data = []

   # starts the range count from 1 instead of 0
   for page in range(1, no_pages):
      try:
        
        #Navigate to the next page once done
        table_page = str(page)
        WebDriverWait(browser, 20).until(EC.visibility_of_element_located((By.XPATH, '//*[@id="MainContent_gvUtilities"]/tbody/tr[52]/td/ul/li['+ table_page + ']/a'))).click()
        print("Navigating to next table page...")
      
      except (TimeoutException, WebDriverException):
        print("Last page reached, moving to the next region...")
        break

   print("No more pages to scrape under %s. Moving to the next region..." %region)

browser.close()
browser.quit() 

【问题讨论】:

    标签: python selenium web-scraping


    【解决方案1】:

    以下根据结果计数和已知的每页最大结果数来计算页数。

    它通过单​​击包含此页码的相应 href 循环。如果此数字不可见,则处理引发的异常并单击初始分页省略号以显示页面。

    我打印大于 1 的页面的第一个 tr 第一个 td,以表明该页面已被访问。我还去掉了硬编码的等待条件。

    我用过 ChromeDriver。

    这是给你一个框架来使用。我对其进行了测试,它适用于所有区域选择和页面。


    import time
    import pandas as pd
    from selenium import webdriver
    from selenium.webdriver.support.ui import Select, WebDriverWait
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.common.exceptions import TimeoutException, WebDriverException, NoSuchElementException
    import math
    
    results_per_page = 50
    url = 'https://database.ib-net.org/search_utilities?type=2'
    browser = webdriver.Chrome() #FireFox()
    browser.get(url)
    print("Retriving the site...")
    
    # All regions available
    regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada)', 'Middle East and Northern Africa', 'South Asia']
    
    for region in regions:
       # Select all options from drop down menu
        selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))
    
        print("Now constructing output for: " + region)
        
        # Select table and wait for data to populate
        selectOption.select_by_visible_text(region)
        
        WebDriverWait(browser, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#MainContent_gvUtilities tr > td')))
        num_results = int(browser.find_element_by_id('MainContent_lblqResults').text)
        num_pages = math.ceil(num_results/results_per_page)
        print(f'pages to scrape are: {num_pages}')
        
        for page in range(2, num_pages + 1):
            print(f'visiting page {page}')
            try:
                browser.find_element_by_css_selector(f'.pagination > li > [href*="Page\${page}"]').click()
                WebDriverWait(browser, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#MainContent_gvUtilities tr > td')))
                print(browser.find_element_by_css_selector('#MainContent_gvUtilities tr:nth-child(2) span').text)
            except NoSuchElementException:
                browser.find_element_by_css_selector('.pagination > li > a').click()
            except Exception as e:
                print(e)
                continue
                
    

    【讨论】:

      猜你喜欢
      • 2014-07-07
      • 1970-01-01
      • 2021-10-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-03-31
      • 1970-01-01
      相关资源
      最近更新 更多