使用 Beautiful Soup 在特定标签中获取字符串答案

【问题标题】：Getting a string in a specific tag with Beautiful Soup使用 Beautiful Soup 在特定标签中获取字符串
【发布时间】：2021-10-09 08:06:41
【问题描述】：

我尝试从https://webscraper.io/test-sites 网站获取所有标题。为此，我使用美丽的汤。标题（在本例中为电子商务网站）始终包含在代码的以下部分中：

<h2 class="site-heading">
    <a href="/test-sites/e-commerce/allinone">
        E-commerce site
    </a>
</h2>

我不明白那部分。我已经尝试了不同的东西，但例如对我来说最直观的代码不起作用：

import re
from bs4 import BeautifulSoup
import requests

url = 'https://webscraper.io/test-sites'
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html)
string = soup.find_all("h2", string=re.compile("E-commerce")

我怎样才能只获得标题，在本例中为列表的“电子商务网站”？

【问题讨论】：

你得到了什么？

标签： python regex beautifulsoup

【解决方案1】：

如果我对您的理解正确，您希望获得所有可用标题的列表。你可以这样做：

titles = [x.getText() for x in soup.find_all("h2", {class_="site-heading"})]

【讨论】：

我不太了解 Python，但正确的代码可能是 soup.find_all 而不是像其他人写的那样 soup.findall。

【解决方案2】：

你很接近。几个问题。

您没有使用任何解析器来解析r_html。我在这里用过html.parser。
我认为没有必要在您的问题中使用 Regex re。
标题出现在h2 标签内，类名是site-heading。您可以选择它们。

此代码选择所有标题并打印它们。

from bs4 import BeautifulSoup
import requests

url = 'https://webscraper.io/test-sites'
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html,"html.parser")
string = soup.find_all("h2", class_='site-heading')

for i in string:
    print(i.text.strip())

E-commerce site
E-commerce site with pagination links
E-commerce site with popup links
E-commerce site with AJAX pagination links
E-commerce site with "Load more" buttons
E-commerce site that loads items while scrolling
Table playground

【讨论】：

【解决方案3】：

import re

import requests
from bs4 import BeautifulSoup

url = 'https://webscraper.io/test-sites'
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html, features="html.parser")
h2s = soup.find_all("h2")
for h2 in h2s:
    print(h2.text.strip())

这将为您提供 H2 中的所有文本。

如果这对你有帮助，请告诉我。

【讨论】：

谢谢我使用了@Ram 的第一个答案，这真的很有帮助。