BeautifulSoup - 隔离特定表的内容答案

【问题标题】：BeautifulSoup - Isolate contents of a specific tableBeautifulSoup - 隔离特定表的内容
【发布时间】：2021-12-24 10:28:41
【问题描述】：

我是使用 Beautiful Soup 进行数据抓取的新手。我想从 pro-football-reference 中获取有关这些统计数据的数据：https://www.pro-football-reference.com/boxscores/201009090nor.htm#all_pbp

我想遍历完整播放表下“详细信息列”下的每一行，以便如果详细信息包含“惩罚”一词，我可以保存它。有没有人知道我怎么可能做到这一点？这张表似乎与其他表不同。

# Any example of how I extracted another element (Referee Name) 
# from the same page but different table

table = soup.select_one('#all_officials').find_next(text=lambda t: isinstance(t, Comment))
table = BeautifulSoup(table, 'html.parser')
for tr in table.select('tr'):
    tds = [td.get_text(strip=True) for td in tr.select('td')]
    if str(*tds) != "Officials":
        referee = str(*tds)
            break

【问题讨论】：

标签： python beautifulsoup

【解决方案1】：

表格被注释掉了。一种常见且可靠的方法是导入Comment 并使用for comment in soup.find_all(text=lambda text: isinstance(text, Comment)) 进行处理，如下所示：https://stackoverflow.com/a/60381103。

对于这个特定的例子，我只是通过替换来删除 cmets 字符串。

然后我使用:-soup-contains 来定位适当的行，只过滤表中文本Penalty 出现在具有data-stat 属性值=detail 的元素中的那些行，即详细信息列。

然后我使用pandas 从过滤后的trs html 中重构表格，然后用table 标签结束书本

from bs4 import BeautifulSoup as bs
import pandas as pd
import requests
import re

r = requests.get('https://www.pro-football-reference.com/boxscores/201009090nor.htm#all_pbp',
                 headers={'User-Agent': 'Mozilla/5.0'})

s = re.sub(r'<!--|-->', '', r.text)
soup = bs(s, 'lxml')
s2 = '<table>' + ''.join([str(i) for i in soup.select(
    '#pbp tr:has([data-stat=detail]:-soup-contains("Penalty"))')]) + '</table>'
df = pd.read_html(s2)[0]
df.columns = [i.text for i in soup.select('#pbp thead > tr > th')]
df

【讨论】：