【问题标题】:scrape element not visible in page source页面源中不可见的抓取元素
【发布时间】:2020-12-27 11:40:17
【问题描述】:

我正在尝试抓取一个看起来像是由 Javascript 生成的网站 (https://harleytherapy.com/therapists?page=1),而我尝试抓取的元素(luid="downshift-7-menu")没有出现在“页面源”,但只有在我点击“检查元素”之后。

我试图在这里找到一个解决方案,到目前为止,这是我能够想出的代码(硒 + 美丽汤的组合)

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time

url = "https://harleytherapy.com/therapists?page=1"

options = webdriver.ChromeOptions()
options.add_argument('headless')
capa = DesiredCapabilities.CHROME
capa["pageLoadStrategy"] = "none"
driver = webdriver.Chrome(chrome_options=options, desired_capabilities=capa)
driver.set_window_size(1440,900)
driver.get(url)
time.sleep(15)

plain_text = driver.page_source
soup = BeautifulSoup(plain_text, 'html')
therapist_menu_id = "downshift-7-menu"
print(soup.find(id=therapist_menu_id))

我认为让 Selenium 等待 15 秒可以确保所有元素都已加载,但我仍然无法在汤中找到任何 id 为 downshift-7-menu 的元素。你们知道我的代码有什么问题吗?

【问题讨论】:

    标签: python selenium web-scraping beautifulsoup


    【解决方案1】:

    ID downshift-7-menu 的元素只有在打开 THERAPIST 下拉菜单后才会加载,您可以通过将其滚动到视图中加载它然后单击它来完成。您还应该考虑用显式等待替换睡眠

    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    
    wait = WebDriverWait(driver, 15)
    
    # scroll the dropdown into view to load it
    side_menu = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'inner-a377b5')))
    last_height = driver.execute_script("return arguments[0].scrollHeight", side_menu)
    while True:
        driver.execute_script("arguments[0].scrollTo(0, arguments[0].scrollHeight);", side_menu)
        new_height = driver.execute_script("return arguments[0].scrollHeight", side_menu)
        if new_height == last_height:
            break
        last_height = new_height
    
    # open the menu
    wait.until(EC.visibility_of_element_located((By.ID, 'downshift-7-input'))).click()
    
    # wait for the option to load
    therapist_menu_id = 'downshift-7-menu'
    wait.until(EC.presence_of_element_located((By.ID, therapist_menu_id)))
    print(soup.find(id=therapist_menu_id))
    

    【讨论】:

    • 谢谢!这有效,但只有在我等待 Selenium 打开网站并手动单击 Therapist 下拉菜单之后。有没有办法跳过手动步骤?
    • @Brian 我编辑了我的答案,首先滚动到下拉列表以使其可见
    猜你喜欢
    • 2017-04-08
    • 1970-01-01
    • 2019-03-25
    • 1970-01-01
    • 2020-05-22
    • 2017-06-30
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多