【问题标题】:Extract text inside HTML paragraph using BeautifulSoup in Python在 Python 中使用 BeautifulSoup 提取 HTML 段落中的文本
【发布时间】:2015-02-22 06:44:05
【问题描述】:
<p>
    <a name="533660373"></a>
    <strong>Title: Point of Sale Threats Proliferate</strong><br />
    <strong>Severity: Normal Severity</strong><br />
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
    <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
    <br />
</p>

这是我想在 Python 中使用 BeautifulSoup 从 HTML 页面中提取的一段。 我可以使用 .children & .string 方法获取标签内的值。 但是我无法在没有任何标签的段落中获得文本“Several new Point of Sale 恶意软件...”。我尝试使用 soup.p.text 、 .get_text() 等。但没有用。

【问题讨论】:

    标签: python html web-scraping beautifulsoup


    【解决方案1】:
    import urllib.request
    from bs4 import BeautifulSoup
    
    url = "https://www.geeksforgeeks.org/how-to-automate-an-excel-sheet-in-python/?ref=feed"
    
    html = urllib.request.urlopen(url)
    
    htmlParse = BeautifulSoup(html, 'html.parser')
    
    for para in htmlParse.find_all("p"):
        print(para.get_text())
    

    【讨论】:

    • 虽然此代码可能会回答问题,但提供有关此代码为何和/或如何回答问题的额外上下文可提高其长期价值。
    【解决方案2】:

    使用find_all()text=True 查找所有文本节点,使用recursive=False 仅在父标签p 的直接子节点中搜索:

    from bs4 import BeautifulSoup
    
    data = """
    <p>
        <a name="533660373"></a>
        <strong>Title: Point of Sale Threats Proliferate</strong><br />
        <strong>Severity: Normal Severity</strong><br />
        <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
        Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
        <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
        <br />
    </p>
    """
    
    soup = BeautifulSoup(data)
    print ''.join(text.strip() for text in soup.p.find_all(text=True, recursive=False))
    

    打印:

    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-08-14
      • 1970-01-01
      • 1970-01-01
      • 2021-12-30
      • 2021-07-12
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多