如何使用 Beautiful Soup 刮桌子？答案

【问题标题】：How to scrape tables using Beautiful Soup?如何使用 Beautiful Soup 刮桌子？
【发布时间】：2021-12-08 18:57:08
【问题描述】：

我尝试根据问题刮表：Python BeautifulSoup scrape tables

从顶级解决方案中，我尝试了：

HTML 代码：

<div class="table-frame small">
    <table id="rfq-display-line-items-list" class="table">
        <thead id="rfq-display-line-items-header">
          <tr>
          <th>Mfr. Part/Item #</th>
          <th>Manufacturer</th>
          <th>Product/Service Name</th>
          <th>Qty.</th>
          <th>Unit</th>
          <th>Ship Address</th>
        </tr>
      </thead>
      <tbody id="rfq-display-line-item-0">

        <tr>
            <td><span class="small">43933</span></td>
            <td><span class="small">Anvil International</span></td>
            <td><span class="small">Cap Steel Black 1-1/2"</span></td>
            <td><span class="small">800</span></td>
            <td><span class="small">EA</span></td>
            <td><span class="small">1</span></td>
        </tr>
      <!----><!---->
      </tbody><tbody id="rfq-display-line-item-1">

        <tr>
            <td><span class="small">330035205</span></td>
            <td><span class="small">Anvil International</span></td>
            <td><span class="small">1-1/2" x 8" Black Steel Nipple</span></td>
            <td><span class="small">400</span></td>
            <td><span class="small">EA</span></td>
            <td><span class="small">1</span></td>
        </tr>
      <!----><!---->
      </tbody><!---->
    </table><!---->
</div>

根据解决方案，

我尝试的是：

for tr in soup.find_all('table', {'id': 'rfq-display-line-items-list'}):
    tds = tr.find_all('td')
    print(tds[0].text, tds[1].text, tds[2].text, tds[3].text, tds[4].text, tds[5].text)

但这只显示第一行，

43933 Anvil International Cap Steel Black 1-1/2" 800 EA 1

我后来发现所有<td> 都存储在列表中。我想打印所有行。

预期输出：

43933      Anvil International Cap Steel Black 1-1/2" 800 EA 1
330035205  Anvil International 1-1/2" x 8" Black Steel Nipple 400 EA 1

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup web-crawler

【解决方案1】：

你从tr标签开始，然后到td

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

for tr in soup.find("table", id="rfq-display-line-items-list").find_all("tr"):
    print(" ".join([td.text for td in tr.find_all('td')]))

43933 Anvil International Cap Steel Black 1-1/2" 800 EA 1
330035205 Anvil International 1-1/2" x 8" Black Steel Nipple 400 EA 1

【讨论】：

我也需要重新整理这些行
您能否在帖子中包含预期的输出
包含预期输出
这些值必须组织起来，因为我必须将它们存储在一个列表中。

【解决方案2】：

会发生什么？

当您使用find_all() 选择表格时，您会得到一个只有一个元素（表格）的结果集，这就是为什么您的循环仅迭代元素并仅打印第一行的原因。

如何解决？

选择更具体的目标 - 作为替代方法，您也可以使用 css selctors 和 stripped_strings 来完成您的任务。

这将从带有id="rfq-display-line-items-list"的元素（表）的<tbody>中选择所有<tr>：

soup.select('#rfq-display-line-items-list tbody tr')

stripped_strings 作为生成器获取row 中所有元素（<td>s）的字符串，您可以将join() 转换为字符串：

" ".join(list(row.stripped_strings))

示例

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

for row in soup.select('#rfq-display-line-items-list tbody tr'):
    print(" ".join(list(row.stripped_strings)))

输出

43933 Anvil International Cap Steel Black 1-1/2" 800 EA 1
330035205 Anvil International 1-1/2" x 8" Black Steel Nipple 400 EA 1

【讨论】：