【发布时间】:2021-09-02 15:13:44
【问题描述】:
我正在尝试抓取此 (https://www.qconcursos.com/questoes-de-concursos/questoes?discipline_ids%5B%5D=13&discipline_ids%5B%5D=16&discipline_ids%5B%5D=39&discipline_ids%5B%5D=46&discipline_ids%5B%5D=56&discipline_ids%5B%5D=57&examining_board_ids%5B%5D=1&examining_board_ids%5B%5D=2&examining_board_ids%5B%5D=5&page=2&scholarity_ids%5B%5D=1&scholarity_ids%5B%5D=2) 网页。我正在提取网站上的所有图像。但是,它们不包含大小(宽度、高度)属性,因此将它们与原始属性一起提取。话虽如此,图像最终太大了。这就是为什么我要提取渲染大小并为每个标签添加宽度和高度标签的原因。
例子:
<img src="https://s3.amazonaws.com/assets.qconcursos-hmg.com/cms/brazil-week/logo.svg">
必须成为
<img src="https://s3.amazonaws.com/assets.qconcursos-hmg.com/cms/brazil-week/logo.svg" height="32" width="120">
我能够获取所有图像和正确的尺寸。我的问题是:我无法将值插入标签中。
这是我正在尝试使用的代码:
driver.execute_script(f'let element = document.querySelector("#image_sec>img"); element.setAttribute("width", "{w}"); element.setAttribute("height", "{h}");')
所以我需要 CSS 选择器来使用 javascript 查找元素并将属性设置为标签。
这是您可以用来重现问题的代码:
from selenium import webdriver
from selenium.webdriver import ChromeOptions, Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from bs4 import BeautifulSoup
import requests
import undetected_chromedriver as uc
from scrapy.selector import Selector
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-logging"])
options.add_argument("start-maximized")
driver = uc.Chrome(options=options)
driver.get("https://www.qconcursos.com/questoes-de-concursos/questoes?discipline_ids%5B%5D=13&discipline_ids%5B%5D=16&discipline_ids%5B%5D=39&discipline_ids%5B%5D=46&discipline_ids%5B%5D=56&discipline_ids%5B%5D=57&examining_board_ids%5B%5D=1&examining_board_ids%5B%5D=2&examining_board_ids%5B%5D=5&page=2&scholarity_ids%5B%5D=1&scholarity_ids%5B%5D=2")
while True:
soup = BeautifulSoup(driver.page_source, 'html.parser')
link = driver.current_url
try:
images = driver.find_elements(By.XPATH, '//img')
for img in images:
size = img.size
w, h = size['width'], size['height']
driver.execute_script(f'let element = document.querySelector("#image_sec>img"); element.setAttribute("width", "{w}"); element.setAttribute("height", "{h}");')
except NoSuchElementException:
pass
更新:我已尝试遵循此解决方案,但没有成功。 Is there a way to extract the CSS selector with Selenium?,有 2 个答案。第一个检索标签,但不添加任何属性。第二个也是一样。
【问题讨论】:
-
为什么不能使用其他问题中提到的 requests 或 beautifulsoup?它们的使用有什么限制吗?
-
我编辑了我的答案并添加了一步一步来重现问题,以防你想看看。我会再试一次 requests 和 beautifulsoup。
标签: python python-3.x selenium selenium-webdriver