【问题标题】:Creating a html table from a list in Python BeautifulSoup从 Python BeautifulSoup 中的列表创建 html 表
【发布时间】:2021-08-01 05:33:47
【问题描述】:

我在 Python 中使用 bs4,我想从 python 中的列表中获取内容并使用 bs4 将其输入到 html 代码中,以便可以使用 requests.put() 方法将 html 表发布到网站链接上。 html代码是这样的,每一行都包含标签:

<tr></tr>

每个单元格,即每一列中对应于一行的一个数据元素由标记表示:

<td></td>

所以每个数据元素都会进入 td 标签内,包围我的 p 标签,例如:

<tr><td><p>data 1 in cell 1</p></td><td><p>data 2 in cell 2</p></td></tr>

应该进入html表格的数据是列表的形式,看起来像:

rows = ["1" + "````" + "Mon, 22 Feb 2021 13:44:27 -0800" + "````" + "Jam" + "````" + "IAP-5998" + "````" + "10004" + "````" + "Model Observing a ModelIPCException" + "````" + "1ba4416fdd7", "2" + "````" + "Mon, 30 Feb 2021 13:44:27 -0800" + "````" + "Rizwan" + "````" + "IAP-6998" + "````" + "10014" + "````" + "Model Observing." + "````" + "3ba4416fdd7", "3" + "````" + "Fri, 20 Mar 2021 13:44:27 -0800" + "````" + "John" + "````" + "ATL-5998" + "````" + "10456" + "````" + "Exception during JumpToROM function call." + "````" + "8ca4416fdd7", "4" + "````" + "Mon, 14 Feb 2021 13:44:27 -0800" + "````" + "Brock Lesnar" + "````" + "IAP-6005" + "````" + "10009" + "````" + "RAM flushing JumpToROM function call." + "````" + "1ba4416fd10"]

因此,在列表中,每个元素对应于一行,并且每个单元格都按照“````”进行拆分,因此 1 进入第一个单元格,Jam 进入第一行的第 3 个单元格。 html 表格字符串前面应有表格标题,并应以表格页脚结束,如下所示:

html_table_header = "<p><br /></p><table><colgroup><col style=\"width: 115.0px;\" /><col style=\"width: 95.0px;\" /><col style=\"width: 58.0px;\" /><col style=\"width: 105.0px;\" /><col style=\"width: 110.0px;\" /><col style=\"width: 215.0px;\" /><col style=\"width: 215.0px;\" /></colgroup><tbody><tr><th><p>No.</p></th><th><p>Date and Time</p></th><th><p>Author</p></th><th><p>Jira</p></th><th><p>PR</p></th><th><p>Title</p></th><th><p>Commit ID</p></th></tr>"

html_table_footer = "</tbody></table><p class=\"auto-cursor-target\"><br /></p>"

因此,构成用于创建表格的数据的整个 html 代码应如下所示:

<p><br /></p><table><colgroup><col style=\"width: 115.0px;\" /><col style=\"width: 95.0px;\" /><col style=\"width: 58.0px;\" /><col style=\"width: 105.0px;\" /><col style=\"width: 110.0px;\" /><col style=\"width: 215.0px;\" /><col style=\"width: 215.0px;\" /></colgroup><tbody><tr><th><p>No.</p></th><th><p>Date and Time</p></th><th><p>Author</p></th><th><p>Jira</p></th><th><p>PR</p></th><th><p>Title</p></th><th><p>Commit ID</p></th></tr><tr><td><p>1</p></td><td><p>Mon, 22 Feb 2021 13:44:27 -0800</p></td><td><p>Jam</p></td><td><p>IAP-5998</p></td><td><p>10004</p></td><td><p>Model Observing a ModelIPCException</p></td><td><p>1ba4416fdd7</p></td></tr><tr><td><p>2</p></td><td><p>Mon, 30 Feb 2021 13:44:27 -0800</p></td><td><p>Rizwan</p></td><td><p>IAP-6998</p></td><td><p>10014</p></td><td><p>Model Observing</p></td><td><p>1ba4416fdd7</p></td></tr>....................................Other elements in list according to rows go here.............</tbody></table><p class=\"auto-cursor-target\"><br /></p>

这是我使用的代码:

import re
import sys
import requests
import json
from requests.auth import HTTPBasicAuth
from bs4 import BeautifulSoup

html_table_header = "<p><br /></p><table><colgroup><col style=\"width: 115.0px;\" /><col style=\"width: 95.0px;\" /><col style=\"width: 58.0px;\" /><col style=\"width: 105.0px;\" /><col style=\"width: 110.0px;\" /><col style=\"width: 215.0px;\" /><col style=\"width: 215.0px;\" /></colgroup><tbody><tr><th><p>No.</p></th><th><p>Date and Time</p></th><th><p>Author</p></th><th><p>Jira</p></th><th><p>PR</p></th><th><p>Title</p></th><th><p>Commit ID</p></th></tr>"

html_table_footer = "</tbody></table><p class=\"auto-cursor-target\"><br /></p>"

rows = ["1" + "````" + "Mon, 22 Feb 2021 13:44:27 -0800" + "````" + "Jam" + "````" + "IAP-5998" + "````" + "10004" + "````" + "Model Observing a ModelIPCException" + "````" + "1ba4416fdd7", "2" + "````" + "Mon, 30 Feb 2021 13:44:27 -0800" + "````" + "Rizwan" + "````" + "IAP-6998" + "````" + "10014" + "````" + "Model Observing." + "````" + "3ba4416fdd7", "3" + "````" + "Fri, 20 Mar 2021 13:44:27 -0800" + "````" + "John" + "````" + "ATL-5998" + "````" + "10456" + "````" + "Exception during JumpToROM function call." + "````" + "8ca4416fdd7", "4" + "````" + "Mon, 14 Feb 2021 13:44:27 -0800" + "````" + "Brock Lesnar" + "````" + "IAP-6005" + "````" + "10009" + "````" + "RAM flushing JumpToROM function call." + "````" + "1ba4416fd10"]

row_string = ""
for idx in range(0, len(rows)):
    soup = BeautifulSoup("<tr></tr>", 'html.parser')
    for cell_id in range(0, 7):
        original_tag = soup.tr
        new_tag = soup.new_tag("td")
        original_tag.append(new_tag)
        p_tag = soup.new_tag("p")
        original_tag.td.next_sibling.append(p_tag)
        original_tag.p.string = rows[idx].split("````")[cell_id]
        row_string += str(original_tag)

pass_str = html_table_header + row_string + html_table_footer
pass_string = str(pass_str).replace('\"', '\\"')

headers = {
    'Content-Type': 'application/json',
}

data = '{"id":"534756378","type":"page", "title":"GL_Engine Output","space":{"key":"CSSAI"},"body":{"storage":{"value":"' + pass_string + '","representation":"storage"}}, "version":{"number":2}}'

response = requests.put('https://confluence.ai.com/rest/api/content/534756378', headers=headers, data=data,
                        auth=HTTPBasicAuth('svc-Automation@ai.com', 'AIengineering1@ai'))

但在我的代码中,只有列表中的第一个元素,即数字 1、2、3 等进入正确的单元格,但其他元素仍被插入第一列,因此表格在获取时看起来不正确发布到网站上,因为只有表格的标题是正确的,但其他元素都在第一列本身中被压缩在一起。 我查看了发布到我的网站上的 rest/api html 代码,它看起来不正确,如下图所示:

【问题讨论】:

    标签: python html beautifulsoup html-table python-requests


    【解决方案1】:

    我认为您可以使用 pandas 来查看表格和列表理解并拆分,在行的循环中创建表格 html

    from pandas import read_html as rh
    
    pd.set_option('display.expand_frame_repr', False)
    
    html_table_header = "<p><br /></p><table><colgroup><col style=\"width: 115.0px;\" /><col style=\"width: 95.0px;\" /><col style=\"width: 58.0px;\" /><col style=\"width: 105.0px;\" /><col style=\"width: 110.0px;\" /><col style=\"width: 215.0px;\" /><col style=\"width: 215.0px;\" /></colgroup><tbody><tr><th><p>No.</p></th><th><p>Date and Time</p></th><th><p>Author</p></th><th><p>Jira</p></th><th><p>PR</p></th><th><p>Title</p></th><th><p>Commit ID</p></th></tr>"
    
    html_table_footer = "</tbody></table><p class=\"auto-cursor-target\"><br /></p>"
    
    rows = ["1" + "````" + "Mon, 22 Feb 2021 13:44:27 -0800" + "````" + "Jam" + "````" + "IAP-5998" + "````" + "10004" + "````" + "Model Observing a ModelIPCException" + "````" + "1ba4416fdd7", "2" + "````" + "Mon, 30 Feb 2021 13:44:27 -0800" + "````" + "Rizwan" + "````" + "IAP-6998" + "````" + "10014" + "````" + "Model Observing." + "````" + "3ba4416fdd7", "3" + "````" + "Fri, 20 Mar 2021 13:44:27 -0800" + "````" + "John" + "````" + "ATL-5998" + "````" + "10456" + "````" + "Exception during JumpToROM function call." + "````" + "8ca4416fdd7", "4" + "````" + "Mon, 14 Feb 2021 13:44:27 -0800" + "````" + "Brock Lesnar" + "````" + "IAP-6005" + "````" + "10009" + "````" + "RAM flushing JumpToROM function call." + "````" + "1ba4416fd10"]
    body = ''
    
    for row in rows:
        body+= '<tr>' + ''.join([f'<td><p>{i}</p></td>' for i in row.split('````')]) + '</tr>'
        
    html = html_table_header + body + html_table_footer 
    print(rh(html)[0])
    


    包含 bs4(似乎有点多余):

    from bs4 import BeautifulSoup as bs
    
    soup = bs(html, 'lxml')
    print(html)
    print(rh(str(soup))[0])
    

    【讨论】:

      猜你喜欢
      • 2020-09-15
      • 1970-01-01
      • 2011-06-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-01-19
      • 1970-01-01
      相关资源
      最近更新 更多