在 BeautifulSoup 中查找标签的字符串索引答案

【问题标题】：Finding string index of a tag in BeautifulSoup在 BeautifulSoup 中查找标签的字符串索引
【发布时间】：2023-04-10 06:37:01
【问题描述】：

BeautifulSoup 是否提供一种方法来获取标签的字符串索引或其在 HTML 字符串中的文本？

例如：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'lxml')

有没有办法知道html_doc 中的字符串索引是从soup.p (The Dormouse's Story) 开始的？或者它的文本 (The Dormouse's story) 从哪里开始？

编辑：soup.p 的预期索引为 63，即html_doc.index('''The Dormouse's story''')。其文本的预期索引为 83。我没有使用 str.index()，因为返回的索引可能与相关标签不对应。

【问题讨论】：

标签： python html string beautifulsoup

【解决方案1】：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="title"><b>The Dormouse's story</b></p>
</body>
</html>
"""
def findall(patt, s):
    '''Yields all the positions of the pattern patt in the string s.'''
    i = s.find(patt)
    while i != -1:
        yield i
        i = s.find(patt, i+1)

soup = BeautifulSoup(html_doc, 'html.parser')
x = str(soup)
y = str(soup.find("p", {'class':'title'}))
print([(i, x[i:i+len(y)]) for i in findall(y, x)])

【讨论】：

如果多次出现完全相同的标签怎么办？ BeautifulSoup 标记对象是否像词法分析器一样在解析它的字符串中保留一个位置？

【解决方案2】：

您似乎在进行一些网络抓取。我建议您查看XPath - Google 搜索您正在编码的语言的 XPath 库。

使用 XPath 选择器，您可以找到如下文本元素：

("//text()[contains(.,"The Dormouse's story")]")

从这里开始，如果需要段落元素，只需选择其父类即可。

【讨论】：

【解决方案3】：

你可以这样做。

print(soup.find("p").text)

输出是，

The Dormouse's story

可以更改html_doc的内容以验证代码逻辑。

像这样更改html_doc。

html_doc = """
<html><head><title>The EEEE's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
</body>
</html>
"""

代码与上面的输出相同。

【讨论】：