如何使用 BeautifulSoup 从特定表中获取所有行？答案

【问题标题】：How do you get all the rows from a particular table using BeautifulSoup?如何使用 BeautifulSoup 从特定表中获取所有行？
【发布时间】：2011-01-01 21:37:41
【问题描述】：

我正在学习 Python 和 BeautifulSoup 从网络上抓取数据，并读取 HTML 表格。我可以将它读入 Open Office，它说它是 Table #11。

BeautifulSoup 似乎是首选，但谁能告诉我如何获取特定表和所有行？我查看了模块文档，但无法理解它。我在网上找到的许多示例似乎比我需要的要多。

【问题讨论】：

标签： python beautifulsoup

【解决方案1】：

如果你有一大块 HTML 要用 BeautifulSoup 解析，这应该很简单。一般的想法是使用findChildren 方法导航到您的表格，然后您可以使用string 属性获取单元格内的文本值。

>>> from BeautifulSoup import BeautifulSoup
>>> 
>>> html = """
... <html>
... <body>
...     <table>
...         <th><td>column 1</td><td>column 2</td></th>
...         <tr><td>value 1</td><td>value 2</td></tr>
...     </table>
... </body>
... </html>
... """
>>>
>>> soup = BeautifulSoup(html)
>>> tables = soup.findChildren('table')
>>>
>>> # This will get the first (and only) table. Your page may have more.
>>> my_table = tables[0]
>>>
>>> # You can find children with multiple tags by passing a list of strings
>>> rows = my_table.findChildren(['th', 'tr'])
>>>
>>> for row in rows:
...     cells = row.findChildren('td')
...     for cell in cells:
...         value = cell.string
...         print("The value in this cell is %s" % value)
... 
The value in this cell is column 1
The value in this cell is column 2
The value in this cell is value 1
The value in this cell is value 2
>>>

【讨论】：

这就是诀窍！代码有效，我应该能够根据需要对其进行修改。非常感谢。最后一个问题。除了在表格中搜索 children th 和 tr 时，我可以按照代码进行操作。这只是搜索我的表格并返回表格标题和表格行吗？如果我只想要表格行，我可以只搜索 tr 吗？再次感谢！
是的，.findChildren(['th', 'tr']) 正在搜索标签类型为 th 或 tr 的元素。如果您只想查找 tr 元素，您将使用 .findChildren('tr')（注意不是列表，只是字符串）
还值得注意的是，PyQuery 是 BeautifulSoup 的一个非常好的替代品。
th 是标题单元格。这是一个格式错误的表格。

【解决方案2】：

如果您曾经有嵌套表（如在老式设计的网站上），上述方法可能会失败。

作为一种解决方案，您可能希望先提取非嵌套表：

html = '''<table>
<tr>
<td>Top level table cell</td>
<td>
    <table>
    <tr><td>Nested table cell</td></tr>
    <tr><td>...another nested cell</td></tr>
    </table>
</td>
</tr>
</table>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
non_nested_tables = [t for t in soup.find_all('table') if not t.find_all('table')]

或者，如果您想提取所有表格的内容，包括嵌套其他表格的表格，您可以只提取顶级 tr 和 th/td 标头。为此，您需要在调用find_all 方法时关闭递归：

soup = BeautifulSoup(html, 'lxml')
tables = soup.find_all('table')
cnt = 0
for my_table in tables:
    cnt += 1
    print ('=============== TABLE {} ==============='.format(cnt))
    rows = my_table.find_all('tr', recursive=False)                  # <-- HERE
    for row in rows:
        cells = row.find_all(['th', 'td'], recursive=False)          # <-- HERE
        for cell in cells:
            # DO SOMETHING
            if cell.string: print (cell.string)

输出：

=============== TABLE 1 ===============
Top level table cell
=============== TABLE 2 ===============
Nested table cell
...another nested cell

【讨论】：

【解决方案3】：

如果您没有嵌套表，递归是一个很好的技巧，但如果您有，那么您需要一次做一层。

可能会咬你的一个 HTML 变体是以下也使用 tbody 和/或 thead 元素的地方。

html = '
    <table class="fancy">
        <thead>
           <tr><th>Nested table cell</th></tr>
        </thead>
        <tbody>
            <tr><td><table id=2>...another nested cell</table></td></tr>
        </tbody> 
        </table>
    </table>

在这种情况下，您需要执行以下操作

   table = soup.find_all("table", {"class": "fancy"})[0]
    thead = table.find_all('thead', recursive=False)
    header = thead[0].findChildren('th')
    
    tbody = table.find_all('tbody', recursive=False)
    rows = tbody[0].find_all('tr', recursive=False)

现在你有了头部和行

【讨论】：