如何使用 BeautifulSoup 在标签内获取 html 文本答案

【问题标题】：How to get a html text inside tag using BeautifulSoup如何使用 BeautifulSoup 在标签内获取 html 文本
【发布时间】：2021-12-23 00:31:44
【问题描述】：

如何使用beautifulsoup 从示例 HTML 中提取数据？

<Tag1>
    <message code="able to extract text from here"/>
    <text value="able to extract text that is here"/>
    <htmlText>&lt;![CDATA[&lt;p&gt;some thing &lt;lite&gt;OR&lt;/lite&gt;get exact data from here&lt;/p&gt;]]&gt;</htmlText>
</Tag1>

我尝试了.findall 和.get_text，但是我无法从htmlText 元素中提取文本值。

预期输出：

some thing ORget exact data from here

【问题讨论】：

标签： python html python-3.x beautifulsoup

【解决方案1】：

您可以使用 BeautifulSoup 两次，首先提取 htmlText 元素，然后解析内容。例如：

from bs4 import BeautifulSoup
import lxml

html = """
<Tag1>
    <message code="able to extract text from here"/>
    <text value="able to extract text that is here"/>
    <htmlText>&lt;![CDATA[&lt;p&gt;some thing &lt;lite&gt;OR&lt;/lite&gt;get exact data from here&lt;/p&gt;]]&gt;</htmlText>
</Tag1>
"""
soup = BeautifulSoup(html, "lxml")

for tag1 in soup.find_all("tag1"):
    cdata_html = tag1.htmltext.text
    cdata_soup = BeautifulSoup(cdata_html, "lxml")
    
    print(cdata_soup.p.text)

将显示的内容：

some thing ORget exact data from here

注意：lxml 也需要使用pip install lxml 安装。 BeautifulSoup 会自动导入这个。

【讨论】：

【解决方案2】：

以下是您需要执行的步骤：

# firstly, select all "htmlText" elements
soup.select("htmlText")


# secondly, iterate over all of them
for result in soup.select("htmlText"):
    # further code


# thirdly, use another BeautifulSoup() object to parse the data
# otherwise you can't access <p>, <lite> elements data
# since they are unreachable to first BeautifulSoup() object
for result in soup.select("htmlText"):
    final = BeautifulSoup(result.text, "lxml")


# fourthly, grab all <p> elements AND their .text -> "p.text"
for result in soup.select("htmlText"):
    final = BeautifulSoup(result.text, "lxml").p.text

代码和example in the online IDE（使用最易读的）：

from bs4 import BeautifulSoup
import lxml

html = """
<Tag1>
    <message code="able to extract text from here"/>
    <text value="able to extract text that is here"/>
    <htmlText>&lt;![CDATA[&lt;p&gt;some thing &lt;lite&gt;OR&lt;/lite&gt;get exact data from here&lt;/p&gt;]]&gt;</htmlText>
</Tag1>
"""

soup = BeautifulSoup(html, "lxml")


# BeautifulSoup inside BeautifulSoup
unreadable_soup = BeautifulSoup(BeautifulSoup(html, "lxml").select_one('htmlText').text, "lxml").p.text
print(unreadable_soup)


example_1 = BeautifulSoup(soup.select_one('htmlText').text, "lxml").p.text
print(text_1)


# wihtout hardcoded list slices
for result in soup.select("htmlText"):
    example_2 = BeautifulSoup(result.text, "lxml").p.text
    print(example_2)


# or one liner
example_3 = ''.join([BeautifulSoup(result.text, "lxml").p.text for result in soup.select("htmlText")])
print(example_3)


# output
'''
some thing ORget exact data from here
some thing ORget exact data from here
some thing ORget exact data from here
some thing ORget exact data from here
'''

【讨论】：