使用汤 python 在 html 页面中查找 aria-label答案

【问题标题】：find aria-label in html page using soup python使用汤 python 在 html 页面中查找 aria-label
【发布时间】：2020-01-10 06:13:32
【问题描述】：

我有 html 页面，代码如下：

<span itemprop="title" data-andiallelmwithtext="15" aria-current="page" aria-label="you in page number 452">page 452</span>

我想找到 aria-label，所以我尝试了这个：

is_452 = soup.find("span", {"aria-label": "you in page number 452"})
print(is_452)

我想得到结果：

is_452 =page 452

我得到了结果：

is_452=none

怎么做？

【问题讨论】：

标签： python-3.x selenium beautifulsoup find

【解决方案1】：

里面有换行符，所以不能通过文本匹配。试试下面的

from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''<span itemprop="title" data-andiallelmwithtext="15" aria-current="page" aria-label="you in page
number 452">page 452</span>'''
doc = SimplifiedDoc(html)
is_452 = doc.getElementByReg('aria-label="you in page[\s]*number 452"',tag="span")
print (is_452.text)

【讨论】：

它不起作用，我总是遇到异常
一个版本有问题。如果您可以之前运行它并稍后更新它，那么您可能使用了有问题的版本。我已经修改了上面的代码，或者你可以更新库。请再试一次。如果您有任何问题，请告诉我。
html 中应该包含什么？我这样做：'soup = BeautifulSoup(res.text, "html.parser")' 然后'oc = SimplifiedDoc(soup)'
Simplifieddoc只有一个参数，不依赖其他库。 SimplifiedDoc(res.text) 下面是一个例子：github.com/yiyedata/simplified-scrapy-demo/tree/master/…

【解决方案2】：

可能需要的元素是动态元素，您可以使用Selenium 提取aria-label 属性的值，从而诱导WebDriverWait 对于visibility_of_element_located()，您可以使用以下任一Locator Strategies：

使用CSS_SELECTOR：

print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "section#header a.cart-heading[href='/cart']"))).get_attribute("aria-label"))

使用XPATH：

print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//section[@id='header']//a[@class='cart-heading' and @href='/cart']"))).get_attribute("aria-label"))

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

【讨论】：

我无法使用驱动程序。我可以通过对 url 进行请求来获取这些数据吗？

【解决方案3】：

soup 失败的原因是换行。我有一个更简单的解决方案，它不使用任何单独的库，只使用 BeautifulSoup。我知道这个问题很老，但它有 1k 的浏览量，所以很明显很多人都在搜索这个问题。您可以使用三引号字符串来考虑换行符。这个：

is_452 = soup.find("span", {"aria-label": "you in page number 452"})
print(is_452)

会变成：

search_label = """you in page
number 452"""
is_452 = soup.find("span", {"aria-label": search_label})
print(is_452)

【讨论】：