如何从 BeautifulSoup 中的 td 获得价值？答案

【问题标题】：How to get value from td in BeautifulSoup?如何从 BeautifulSoup 中的 td 获得价值？
【发布时间】：2021-07-28 08:32:41
【问题描述】：

我有一个页面，其中包含一些表格：

<td class="ng-binding">9:20 AM</td>,
 <td class="ng-binding"><span class="ng-binding" ng-bind-html="objFlight.flight.statusMessage.text | unsafe">Scheduled</span> </td>,
 <td class="ng-binding">1:05 PM</td>,
 <td class="ng-binding"><span class="ng-binding" ng-bind-html="objFlight.flight.statusMessage.text | unsafe">Scheduled</span> </td>,
 <td class="ng-binding">1:15 PM</td>,
 <td class="ng-binding"><span class="ng-binding" ng-bind-html="objFlight.flight.statusMessage.text | unsafe">Scheduled</span> </td>,
 <td class="ng-binding">9:20 AM</td>,
 <td class="ng-binding"><span class="ng-binding" ng-bind-html="objFlight.flight.statusMessage.text | unsafe">Scheduled</span> </td>,
 <td class="ng-binding" colspan="7">* All times are in local timezone</td>

我想从这个页面获取时间：

9:20 AM
1:05 PM
1:15 PM
9:20 AM

但是，我的代码：

times=soup.find_all('td',{'class':'ng-binding'})
for time in times:
    a = time.text.strip()
    print(a)

--------------------------------------------------------

9:20 AM
Scheduled
1:05 PM
Scheduled
1:15 PM
Scheduled
9:20 AM
Scheduled
* All times are in local timezone

如何解决这个问题并从页面获得预期的输出？谢谢

【问题讨论】：

标签： python html beautifulsoup html-table html-parsing

【解决方案1】：

如果如您所见（我怀疑您可能需要添加某种锚点），您可以使用nth-child(odd)，然后使用colspan 过滤掉td

[i.text for i in soup.select('td:nth-child(odd):not([colspan])')]

在没有看到更多 HTML 的情况下，关于您的后续评论，可以使用 .endswith 提前过滤您当前的列表（在有限的 HTML 中不确定有多可靠）

[i.text for i in soup.select('td:nth-child(odd):not([colspan])') if i.text.endswith((' AM', ' PM'))]

【讨论】：

我试过了，但我得到了['1:05 PM', 'Denpasar (DPS)-', 'AT7 () ', 'Scheduled ', '1:15 PM', 'Kupang (KOE)-', 'AT7 () ', 'Scheduled ', '9:20 AM', 'Kupang (KOE)-', 'AT7 () ', 'Scheduled ', '1:15 PM', 'Kupang (KOE)-', 'AT7 () ', 'Scheduled ']
那么您的 html 不只是显示的请包含该表格的整个 html。
不太可靠的是[i.text for i in soup.select('td') if i.text.endswith((' AM', ' PM'))]
或结合[i.text for i in soup.select('td:nth-child(odd):not([colspan])') if i.text.endswith((' AM', ' PM'))]

【解决方案2】：

一种集成方式是在获取标签的同时应用条件，这可以通过至少两种方式完成。在这两种方式中，我们都可以将 find_all 中的标签名称替换为应用这些额外条件的函数：

过滤掉带有span的td标签：

def is_td_without_span(tag):
    return tag.name == "td" and not tag.find("span")

times = soup.find_all(is_td_without_span,{'class':'ng-binding'})

使用正则表达式过滤掉带有不匹配文本的 td 标签：

import re
regex = r"\d{1,2}:\d{2} AM|PM" # hour can omit leading 0, minutes can not
def is_td_with_time:
    return tag.name == "td" and re.search(regex, tag.text) is not None

times = soup.find_all(is_td_with_time,{'class':'ng-binding'})

【讨论】：

【解决方案3】：

这是一个使用htql的解决方案：

>>> import htql
>>> results = htql.query(html, "<td (tx =~ '\\d.*')>:tx ")
>>> results
[('9:20 AM',), ('1:05 PM',), ('1:15 PM',), ('9:20 AM',)]

【讨论】：