在 Python 中使用 BeautifulSoup 提取 HTML 段落中的文本答案

【问题标题】：Extract text inside HTML paragraph using BeautifulSoup in Python在 Python 中使用 BeautifulSoup 提取 HTML 段落中的文本
【发布时间】：2015-02-22 06:44:05
【问题描述】：

<p>
    <a name="533660373"></a>
    <strong>Title: Point of Sale Threats Proliferate</strong><br />
    <strong>Severity: Normal Severity</strong><br />
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
    <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
    <br />
</p>

这是我想在 Python 中使用 BeautifulSoup 从 HTML 页面中提取的一段。我可以使用 .children & .string 方法获取标签内的值。但是我无法在没有任何标签的段落中获得文本“Several new Point of Sale 恶意软件...”。我尝试使用 soup.p.text 、 .get_text() 等。但没有用。

【问题讨论】：

标签： python html web-scraping beautifulsoup

【解决方案1】：

import urllib.request
from bs4 import BeautifulSoup

url = "https://www.geeksforgeeks.org/how-to-automate-an-excel-sheet-in-python/?ref=feed"

html = urllib.request.urlopen(url)

htmlParse = BeautifulSoup(html, 'html.parser')

for para in htmlParse.find_all("p"):
    print(para.get_text())

【讨论】：

虽然此代码可能会回答问题，但提供有关此代码为何和/或如何回答问题的额外上下文可提高其长期价值。

【解决方案2】：

使用find_all() 和text=True 查找所有文本节点，使用recursive=False 仅在父标签p 的直接子节点中搜索：

from bs4 import BeautifulSoup

data = """
<p>
    <a name="533660373"></a>
    <strong>Title: Point of Sale Threats Proliferate</strong><br />
    <strong>Severity: Normal Severity</strong><br />
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
    <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
    <br />
</p>
"""

soup = BeautifulSoup(data)
print ''.join(text.strip() for text in soup.p.find_all(text=True, recursive=False))

打印：

Several new Point of Sale malware families have emerged recently, to include LusyPOS,..

【讨论】：