当表格缺少 thead 元素时，使用 beautifulsoup / lxml 检测 HTML 表格中的标题答案

【问题标题】：Detecting header in HTML tables using beautifulsoup / lxml when table lacks thead element当表格缺少 thead 元素时，使用 beautifulsoup / lxml 检测 HTML 表格中的标题
【发布时间】：2017-12-30 17:06:53
【问题描述】：

当 HTML 表格没有 <thead> 元素时，我想检测该表格的标题。（驱动维基百科的MediaWiki，does not support <thead> elements。）我想在BeautifulSoup 和lxml 中使用python。假设我已经有一个table 对象，我想从中取出一个thead 对象、一个tbody 对象和一个tfoot 对象。

当前，parse_thead 在存在<thead> 标签时执行以下操作：

在 BeautifulSoup 中，我得到带有 doc.find_all('table') 的表对象，我可以使用 table.find_all('thead')
在 lxml 中，我在 //table 上的 xpath_expr 上获得带有 doc.xpath() 的表对象，我可以使用 table.xpath('.//thead')

parse_tbody 和 parse_tfoot 的工作方式相同。（我没有编写此代码，而且我对 BS 或 lxml 都没有经验。）但是，如果没有 <thead>，parse_thead 什么也不返回，parse_tbody 将标题和正文一起返回。

我在下面附加一个wikitable instance 作为示例。它缺少<thead> 和<tbody>。相反，所有行，无论是否有标题，都包含在<tr>...</tr> 中，但标题行有<th> 元素，正文行有<td> 元素。如果没有<thead>，识别标题的正确标准似乎是“从一开始，将行放入标题中，直到找到包含不是<th> 的元素的行”。

非常感谢有关如何编写 parse_thead 和 parse_tbody 的建议。如果没有太多经验，我想我也可以

潜入表格对象并在解析之前手动插入thead 和tbody 标签（这看起来不错，因为这样我就不必更改任何其他识别带有<thead> 的表格的代码），或者交替
更改parse_thead 和parse_tbody 以识别只有<th> 元素的表行。（无论使用哪种方法，我似乎真的需要以这种方式检测头身边界。）

我不知道如何做这两件事，我会很感激关于哪种选择更明智以及如何去做的建议。

（编辑：no header rows 和 multiple header rows 的示例。我不能假设它只有一个标题行。）

<table class="wikitable">
<tr>
<th>Rank</th>
<th>Score</th>
<th>Overs</th>
<th><b>Ext</b></th>
<th>b</th>
<th>lb</th>
<th>w</th>
<th>nb</th>
<th>Opposition</th>
<th>Ground</th>
<th>Match Date</th>
</tr>
<tr>
<td>1</td>
<td>437</td>
<td>136.0</td>
<td><b>64</b></td>
<td>18</td>
<td>11</td>
<td>1</td>
<td>34</td>
<td>v West Indies</td>
<td>Manchester</td>
<td>27 Jul 1995</td>
</tr>
</table>

【问题讨论】：

这个表的表头是什么？
第一个 <tr> 标记标题，因为该行中只有 <th>s。这就是 Mediawiki 格式化它的方式。 en.wikipedia.org/wiki/…

标签： python beautifulsoup lxml

【解决方案1】：

我们可以使用<th>标签来检测表头，以防表中不包含<thead>标签。如果一行的所有列都是<th> 标签，那么我们可以假设它是一个标题。基于此，我创建了一个标识标题和正文的函数。

BeautifulSoup 的代码：

def parse_table(table): 
    head_body = {'head':[], 'body':[]}
    for tr in table.select('tr'): 
        if all(t.name == 'th' for t in tr.find_all(recursive=False)): 
            head_body['head'] += [tr]
        else: 
            head_body['body'] += [tr]
    return head_body

lxml 的代码：

def parse_table(table): 
    head_body = {'head':[], 'body':[]}
    for tr in table.cssselect('tr'): 
        if all(t.tag == 'th' for t in tr.getchildren()): 
            head_body['head'] += [tr]
        else: 
            head_body['body'] += [tr]
    return head_body

table 参数是 Beautiful Soup Tag 对象或 lxml Element 对象。 head_body 是一个字典，其中包含两个 <tr> 标记列表，即标题行和正文行。

用法示例：

html = '<table><tr><th>heade</th></tr><tr><td>body</td></tr></table>'
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')
table_rows = parse_table(table)

print(table_rows)
#{'head': [<tr><th>header</th></tr>], 'body': [<tr><td>body</td></tr>]}

【讨论】：

【解决方案2】：

你应该验证tr标签是否包含你想要的th子，如果candidate里面没有th，candidate.th返回None：

possibleHeaders = soup.find("table").findAll("tr")

Headers = []
for candidate in possibleHeaders:
    if candidate.th:
        Headers.append(candidate)

【讨论】：

可能是多行，很遗憾。当然，有很多带有多个标题行的维基百科表格。（以及没有标题行的表格。）
哦，我明白了，那你能提供一个示例页面吗？
这里有两行：en.wikipedia.org/wiki/…
这是一个没有行的：en.wikipedia.org/wiki/…
好吧，我的解决方案适用于多行的表格，但肯定不适用于没有标题行的表格，请注意，即使没有正确的标题，第一列也封装在，因此可以将其视为代码的标头。