如何让 Beautifulsoup 将表格中的串行 HTML 列表解析为 CSV 数据模式？答案

【问题标题】：How do I get Beautifulsoup to Parse a Serial HTML list in a table into a CSV pattern of data?如何让 Beautifulsoup 将表格中的串行 HTML 列表解析为 CSV 数据模式？
【发布时间】：2019-10-28 07:37:30
【问题描述】：

我有一个公司内部网页，其中列出了长长的列表中的各种数据，我想将这些数据转换为 CSV 文件以供查看。数据格式为：

*CUSTOMER_1*
Email Link   Category_Text    Phone_Numbers
Email Link   Category_Text    Phone_Numbers
*Customer_2*
Email Link   Category_Text    Phone_Numbers
Email Link   Category_Text    Phone_Numbers

用 HTML 编码的样子

<table id="responsibility">
    <tr class="customer">
        <td colspan="6">
            <strong>CUSTOMER 1</strong>
        </td>
    </tr>
    <tr id="tr_1" title="Role_Name1">
        <td><a href="email@company.com1">Name_1</a></td>
        <td>Category_Text</td>
        <td>Phone_Numbers</td>
        <td></td>
    </tr>
    <tr id="tr_2" title="Role_Name2">
        <td><a href="email@company.com2">Name_2</a></td>
        <td>Category_Text</td>
        <td>Phone_Numbers</td>
        <td></td>
    </tr>
    <tr class="customer">
        <td colspan="6">
            <strong>CUSTOMER 2</strong>
        </td>
    </tr>
    <tr id="tr_1" title="Role_Name1">
        <td><a href="email@company.com3">Name_3</a></td>
        <td>Category_Text</td>
        <td>Phone_Numbers</td>
        <td></td>
    </tr>
    <tr id="tr_2" title="Role_Name2">
        <td><a href="email@company.com2">Name_2</a></td>
        <td>Category_Text</td>
        <td>Phone_Numbers</td>
        <td></td>
    </tr>
</table>

我想最终得到一个包含这种方式信息的 file.csv

   CUSTOMER1,Role_Name1,Name_1,Email_1,Category_Text,Phone_Numbers
   CUSTOMER1,Role_Name2,Name_2,Email_2,Category_Text,Phone_Numbers
   CUSTOMER2,Role_Name1,Name_3,Email_3,Category_Text,Phone_Numbers
   CUSTOMER2,Role_Name1,Name_2,Email_2,Category_Text,Phone_Numbers

现在我可以获得所有客户名称的列表或所有文本的列表，但我无法弄清楚如何迭代每个客户，然后迭代每个客户的每一行

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("source.html"), "html.parser")

with open("output.csv",'w') as file:
    responsibility=soup.find('table',{'id':'responsibility'})
    line=responsibility.tr
    for i in responsibility:
        print(line)
        line=responsibility.tr.next_sibling

我希望这会打印文档中的每个标签，但它只打印第一个标签，从不循环到下一个标签。

【问题讨论】：

标签： python web-scraping html-table beautifulsoup

【解决方案1】：

关注这行代码：

line=responsibility.tr

在这里，您使用的是.tr 标签，它定位<tr> 标签块的第一个实例并返回它的内容。

这里是什么意思？假设您有 n 个 <tr> 标记实例，那么使用 .tr 将只为您提供这些 n 个 <tr> 实例中的第一个实例。因此，如果您希望提取所有 n 个，请使用 find_all()。它将返回所有可能匹配项的列表。

line=responsibility.find_all("tr", class_="customer")

另外，添加class_="customer" 过滤器。它将帮助您使用“客户”类定位所有<tr> 块。然后只需使用 .next_sibling 即可找到具有title="Role_Name*" 属性的后续 2 行。

因此，要将上述理论付诸实践，请注意：

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("source.html"), "html.parser")

with open("output.csv",'w') as file:
    responsibility=soup.find('table',{'id':'responsibility'})
    lines=responsibility.find_all("tr", class_ = "customer")
    for i in responsibility:
        for line in lines:
            line1=line.next_sibling              #locates tr with title="Role_Name1"
            line2=line.next_sibling.next_sibling #locates tr with title="Role_Name2"
            print(line1)
            print(line2)

【讨论】：