【问题标题】:Split by bs4 tag/Get text between two tags按 bs4 标签拆分/获取两个标签之间的文本
【发布时间】:2026-02-17 02:15:01
【问题描述】:

目前我正在尝试从网页中读取两个标签之间的文本。

这是我目前的代码:

soup = BeautifulSoup(r.text, 'lxml')

text = soup.text

tag_one = soup.select_one('div.first-header')


tage_two = soup.select_one('div.second-header')



text = text.split(tag_one)[1]
text = text.split(tage_two)[0]

print(text)

基本上,我试图通过识别它们的标签来获取第一个和第二个标题之间的文本。我打算通过拆分第一个标签和第二个标签来做到这一点。 这甚至可能吗?有没有更聪明的方法来做到这一点?

示例: 如果你看:https://en.wikipedia.org/wiki/Python_(programming_language)

我想找到一种方法来提取“历史”下的文本,方法是识别“历史”和“特征和哲学”的标签,并通过这些标签进行拆分。

【问题讨论】:

  • 您能否编辑您的问题以包含测试输入和预期输出?我不完全清楚你要做什么。
  • @cody 我现在试过了

标签: python python-3.x split beautifulsoup


【解决方案1】:

在 BeautifulSoup 4.7+ 中,CSS 选择能力得到了很大改进。可以使用现在在 BeautifulSoup 中支持的 CSS4 :has() 选择器来完成此任务:

import requests
from bs4 import BeautifulSoup

website_url = requests.get("https://en.wikipedia.org/wiki/Python_(programming_language)").text
soup = BeautifulSoup(website_url, "lxml")
els = soup.select('h2:has(span#History) ~ *:has(~ h2:has(span#Features_and_philosophy))')
with codecs.open('text.txt', 'w', 'utf-8') as f:
    for el in els:
        print(el.get_text())

输出:

 Guido van Rossum at OSCON 2006.Main article: History of PythonPython was conceived in the late 1980s[31] by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC language (itself inspired by SETL)[32], capable of exception handling and interfacing with the Amoeba operating system.[7] Its implementation began in December 1989.[33] Van Rossum's long influence on Python is reflected in the title given to him by the Python community: Benevolent Dictator For Life (BDFL) –  a post from which he gave himself permanent vacation on July 12, 2018.[34]
Python 2.0 was released on 16 October 2000 with many major new features, including a cycle-detecting garbage collector and support for Unicode.[35]
Python 3.0 was released on 3 December 2008. It was a major revision of the language that is not completely backward-compatible.[36] Many of its major features were backported to Python 2.6.x[37] and 2.7.x version series.  Releases of Python 3 include the 2to3 utility, which automates (at least partially) the translation of Python 2 code to Python 3.[38]
Python 2.7's end-of-life date was initially set at 2015 then postponed to 2020 out of concern that a large body of existing code could not easily be forward-ported to Python 3.[39][40] In January 2017, Google announced work on a Python 2.7 to Go transcompiler to improve performance under concurrent workloads.[41]

【讨论】:

    【解决方案2】:

    你不能按照你希望的方式去做,因为 BS4 是在 dom 上工作的,一种树状结构,而不是线性的。

    使用您的 wiki 示例,您真正需要的是

    1. 查找 id="History"(这是一个跨度)
    2. 向上导航到 H2 元素——记住它是起点。
    3. find id="Features_and_philosophy"(这是另一个跨度)
    4. 向上导航到最近的 H2 元素——记住它是终点。

    现在,请注意两个 H2 元素是兄弟元素(它们具有相同的父元素)。因此,您要做的是获取开始 H2 和结束 H2 之间的每个兄弟姐妹,并且对于每个兄弟姐妹,获取每个兄弟姐妹的全文。

    这并不难,但它是一个循环,您将在其中比较每个兄弟姐妹,直到到达终点。没有你希望的那么简单。

    在更一般的情况下,这要困难得多(或者说乏味,真的),因为您可能不得不在 DOM 树上上下寻找匹配的元素。

    【讨论】: