BeautifulSoup 有多个标签，每个标签都有一个特定的类答案

【问题标题】：BeautifulSoup with multiple tags, each tag with a specific classBeautifulSoup 有多个标签，每个标签都有一个特定的类
【发布时间】：2017-03-20 13:00:04
【问题描述】：

我正在尝试使用 beautifulsoup 来解析网站中的表格。（由于使用受限，我无法分享网站源代码。）

只有当数据具有以下两个带有这些特定类的标签时，我才尝试提取数据。

td, width=40%
tr, valign=top

我这样做的原因是提取同时具有这些标签和类的数据。

我发现了一些关于使用多个标签here 的讨论，但这个讨论只讨论标签而不是类。但是，我确实尝试使用与使用列表相同的逻辑来扩展代码，但我认为我得到的不是我想要的：

 my_soup=soup.find_all(['td',{"width":"40%"},'tr',{'valign':'top'}])

总之，我的查询是如何使用多个标签，每个标签在 find_all 中都有一个特定的类，以便结果“和”两个标签。

【问题讨论】：

你解决了吗？
我刚刚在上面发布了一个赏金，但不是两个标签，就像 OP 想要的那样，我很感兴趣是否有人可以分享一个涉及soup.findall() 函数的解决方案，该函数可以找到 的所有标签要么有td/tr作为标签和被要求的必然属性，如果这有意义的话。
如赏金中所述，我有兴趣保留比赛的顺序。
经过长时间的搜索，我在这里找到了答案。 *.com/a/40305890/5874001

标签： python html tags beautifulsoup findall

【解决方案1】：

您可以将re.compile 对象与soup.find_all 一起使用：

import re
from bs4 import BeautifulSoup as soup
html = """
  <table>
    <tr style='width:40%'>
      <td style='align:top'></td>
    </tr>
  </table>
"""
results = soup(html, 'html.parser').find_all(re.compile('td|tr'), {'style':re.compile('width:40%|align:top')})

输出：

[<tr style="width:40%">
   <td style="align:top"></td>
 </tr>, <td style="align:top"></td>]

通过提供re.compile 对象来指定所需的标签和style 值，find_all 将返回tr 或td 标签的任何实例，其中包含style 的内联属性width:40% 或align:top.

可以通过提供多个属性值来推断此方法以查找元素：

html = """
 <table>
   <tr style='width:40%'>
    <td style='align:top' class='get_this'></td>
    <td style='align:top' class='ignore_this'></td>
  </tr>
</table>
"""
results = soup(html, 'html.parser').find_all(re.compile('td|tr'), {'style':re.compile('width:40%|align:top'), 'class':'get_this'})

输出：

[<td class="get_this" style="align:top"></td>]

编辑 2：简单的递归解决方案：

import bs4
from bs4 import BeautifulSoup as soup
def get_tags(d, params):
  if any((lambda x:b in x if a == 'class' else b == x)(d.attrs.get(a, [])) for a, b in params.get(d.name, {}).items()):
     yield d
  for i in filter(lambda x:x != '\n' and not isinstance(x, bs4.element.NavigableString) , d.contents):
     yield from get_tags(i, params)

html = """
 <table>
  <tr style='align:top'>
    <td style='width:40%'></td>
    <td style='align:top' class='ignore_this'></td>
 </tr>
 </table>
"""
print(list(get_tags(soup(html, 'html.parser'), {'td':{'style':'width:40%'}, 'tr':{'style':'align:top'}})))

输出：

[<tr style="align:top">
  <td style="width:40%"></td>
  <td class="ignore_this" style="align:top"></td>
 </tr>, <td style="width:40%"></td>]

递归函数使您能够为您自己的字典提供某些标签所需的目标属性：此解决方案尝试将任何指定属性与传递给函数的bs4 对象匹配，如果发现匹配，则元素是yielded。

【讨论】：

如果您对style 以外的多个属性感兴趣怎么办？如果你有兴趣过滤 style、id 和 class？
@InfiniteFlashChess soup.find_all 将尝试匹配每个提供的属性，但是，我编写了一个简单的递归函数来提供您想要的功能作为编辑 2 的一部分。
抱歉，我删除了我之前的评论，因为我试图准确地写出我想要的内容并意识到它不合适。我会尝试重述自己。 @Ajax1234
我有兴趣获取带有属性"width":"40%" 的标签td，以及带有属性'valign':'top' 的标签tr。我不想要带有'valign':'top' 属性的td 标记，也不想要带有"width":"40%" 属性的tr 标记。这就是我最初解释 OP 的方式。再次抱歉，让您在最近的编辑上浪费时间。 @Ajax1234 请让我知道这是否有意义。基本上，我正在尝试匹配具有 2 个不同标签和属性的两个不同 bs4 元素。
我不想使用 2 个soup.findall() 语句的原因是tr 标记没有嵌套在td 标记中。正如 OP 所暗示的，它们具有相同的层次结构级别。 @Ajax1234 让我再检查一下递归函数，谢谢。我确实找到了您的替代解决方案，但我能做的至少是检查您的解决方案是否有效。另外，我编辑并重新发布了很多。我不知道他们为什么不赞成。可能是一些嫉妒的笨蛋（我支持它）。

【解决方案2】：

假设 bsObj 是你美丽的汤对象试试：

tr = bsObj.findAll('tr', {'valign': 'top'})
td = tr.findAll('td', {'width': '40%'})

希望这会有所帮助。

【讨论】：

我认为它不起作用，但可能是我遗漏了一些东西。第一行的输出是一个 ResultSet，当您尝试在第二行的 ResultSet 上执行 find_all 时，它会抛出一个错误，指出 ResultSet 没有 find_all 方法。我正在使用 bs4