使用 BeautifulSoup 在 HTML 中查找结束标记内容答案

【问题标题】：Finding end tag content in HTML with BeautifulSoup使用 BeautifulSoup 在 HTML 中查找结束标记内容
【发布时间】：2023-03-16 16:42:01
【问题描述】：

我正在 Windows 7 机器上使用 Python34 中的 BeautifulSoup。我有以下我正在尝试解析的内容

<bound method Tag.find of <div class="accordion">
<p> <span style="color:039; font-size:14px; font-weight:bold">Acetohydroxamic Acid (Lithostat) Tablets</span><br/><br/>



  <strong>Status: Currently in Shortage </strong><br/><br/>



         » <strong>Date first posted</strong>: 

        07/15/2014<br/>



 » <strong>Therapeutic Categories</strong>: Renal<br/>
</p><p style="padding:10px;">
</p>
<h3>

    Mission Pharmacal  (<em>Reverified  01/21/2015</em>)

我正试图在 Date 首次发布后将“07/15/2014”排除在外。我也得把“肾”拿出来。我可以使用 .findAll("strong") 找到所有“强项”，但我无法找到在 /strong>: 之后和下一个
之前获得某些东西的方法。

【问题讨论】：

标签： python python-3.x beautifulsoup

【解决方案1】：

你需要使用.next_sibling 来获取strong 之后的元素 isinstance(el, bs4.Tag) 过滤不是Tag 的元素，最后re.sub 去除空行和:

In [38]: import re

In [39]: import bs4

In [40]: from bs4 import BeautifulSoup

In [41]: soup = BeautifulSoup("""<bound method Tag.find of <div class="accordion">   ....: <p> <span style="color:039; font-size:14px; font-weight:bold">Acetohydroxamic Acid (Lithostat) Tablets</span><br/><br/>
   ....: 
   ....: 
   ....: 
   ....:   <strong>Status: Currently in Shortage </strong><br/><br/>
   ....: 
   ....: 
   ....: 
   ....:         » <strong>Date first posted</strong>: 
   ....: 
   ....:                07/15/2014<br/>
   ....: 
   ....:     
   ....: 
   ....:  » <strong>Therapeutic Categories</strong>: Renal<br/>
   ....: </p><p style="padding:10px;">
   ....: </p>
   ....: <h3>
   ....: 
   ....:        Mission Pharmacal  (<em>Reverified  01/21/2015</em>)""")

In [42]: for strong_tag in soup.find_all('strong'):
   ....:     if not isinstance(strong_tag.next_sibling, bs4.Tag):
   ....:         print(re.sub(r'[:\s]+', '', strong_tag.next_sibling))
   ....:         
07/15/2014
Renal

编辑

有没有办法在不使用循环的情况下获取该日期？

是的，您可以将text 参数用于find。

re.sub('[:\s+]', '', soup.find('strong', text=re.compile('Date')).next_sibling)

【讨论】：

效果很好。我正在尝试使用 IDLE shell 探索文档并学习 BeuatifulSoup，特别是 .next_sibling 和上面“Tag”的用法。有没有办法在不使用循环的情况下获取该日期？
@jer99 查看我的编辑。如果有帮助，也请随时 accept 回答
我仍然在努力解决这个问题。与 next_sibling 一起寻找。我现在正试图将“Mission Pharmacal”变成一个字符串并获取标签之间的文本。 soup.find("h3") 给了我带有标签的整个字符串，所以我尝试了 re.sub('em','',soup.find("h3")) 它告诉我有一个类型错误 - 预期的字符串或缓冲区。我好像迷路了。

【解决方案2】：

为什么不使用正则表达式(?<=/strong>:)([^<]+)。第一组中的?<= 表示它是积极的向后看：这意味着“查找此字符串但不捕获它”。第二组的意思是“匹配除< 之外的任何字符。最后strip 删除了组周围的任何多余空格。

import re
import requests
s = requests.get(url).text
matches = [l.strip() for l in re.findall('(?<=/strong>:)([^<]+)',s)]

【讨论】：

您介意解释一下吗？我是 python 新手 - 我不太了解 findall 的语法...