根据标签值过滤掉答案

【问题标题】：filtering out based on tag value根据标签值过滤掉
【发布时间】：2022-01-16 19:47:53
【问题描述】：

所以我正在使用 BeautifulSoup 进行一些网页抓取，部分结果如下所示：

...
<tr>
    <th> class = "[whatever]" <a href = '[link 1]'> </a></th>
    ...
    ...
    <td> class = "[whatever]" <a href = '[link 2]'> </a></td>
</tr>,
<tr>
    <th> class = "[whatever]" <a href = '[link 1]'> </a></th>
    ...
    ...
    <td> class = "[whatever]" <a href = '[link 3]'> </a></td>
</tr>,
<tr>
    <th> class = "[whatever]" <a href = '[link 1]'> </a></th>
    ...
    ...
    <td> class = "[whatever]" </td>
</tr>,
...

关于三个 tr 块的所有内容在结构上都是相同的，除了第三个块最后没有“a href = [something]”标签这一事实。我如何过滤掉最后一个块？我试图根据长度来做，但它似乎不起作用。

编辑：我的预期结果是这样的：

...
<tr>
    <th> class = "[whatever]" <a href = '[link 1]'> </a></th>
    ...
    ...
    <td> class = "[whatever]" <a href = '[link 2]'> </a></td>
</tr>,
<tr>
    <th> class = "[whatever]" <a href = '[link 1]'> </a></th>
    ...
    ...
    <td> class = "[whatever]" <a href = '[link 3]'> </a></td>
</tr>,
...

【问题讨论】：

python 的 html 解析器？
尝试基于和（或等...）进行解析，然后如果存在 href，您可以基于此进行工作。
“我试着根据长度来做”很好，分享你尝试过的代码，有人会更有可能提供正确的答案。
预期输出是什么？ href 或除此之外的其他所有内容？
对这些问题表示歉意。我已经编辑了帖子以包含我想要的输出。

from bs4 import BeautifulSoup text = """ <tr> <th> class = "[whatever]" <a href = '[link 1]'> </a></th> ... ... <td> class = "[whatever]" <a href = '[link 2]'> </a></td> </tr>, <tr> <th> class = "[whatever]" <a href = '[link 1]'> </a></th> ... ... <td> class = "[whatever]" <a href = '[link 3]'> </a></td> </tr>, <tr> <th> class = "[whatever]" <a href = '[link 1]'> </a></th> ... ... <td> class = "[whatever]" </td> </tr> """ soup = BeautifulSoup(text, features='lxml') for item in soup.find_all(["th", "td"]): if len([c for c in item.children if c.name == 'a']) == 0: item.decompose() print(soup.prettify())

text = """<tr> <th> class = "[whatever]" <a href = '[link 1]'> </a></th> ... ... <td> class = "[whatever]" <a href = '[link 2]'> </a></td> </tr>, <tr> <th> class = "[whatever]" <a href = '[link 1]'> </a></th> ... ... <td> class = "[whatever]" <a href = '[link 3]'> </a></td> </tr>, <tr> <th> class = "[whatever]" <a href = '[link 1]'> </a></th> ... ... <td> class = "[whatever]" </td> </tr>""" soup = BeautifulSoup(text, "html.parser") soup.select('tr:has(> td > a)')