【问题标题】:Is there any way to edit programmatically nested tables in html file using BeatifulSoup?有没有办法使用 BeautifulSoup 以编程方式编辑 html 文件中的嵌套表?
【发布时间】:2020-05-02 03:11:16
【问题描述】:

我正在使用 BeautifulSoup 抓取网页中的表格。我设法将文本放入 txt 文件中。

但是,有些内部包含多个表格。我猜开发人员有一些审美指令,他们无法以任何其他方式编辑单元格以满足他们的要求。我在按原样抓取表格时遇到了很多问题,所以我想知道是否存在一种以编程方式编辑 HTML 的方法,以便将这些嵌套表格中的 txt 外推到原始单元格中。

这是我的意思的一个例子。

来自这样的嵌套表

<tr class="table">
             <td class="table" valign="top">
                <p class="tbl-cod">0403</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Buttermilk, curdled milk and&nbsp;cream, yoghurt, kephir and other fermented or acidified milk and&nbsp;cream, whether or not concentrated or&nbsp;containing added sugar or other sweetening matter or flavoured or&nbsp;containing added fruit, nuts or&nbsp;cocoa</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Manufacture in which:</p>
                <table width="100%" cellspacing="0" cellpadding="0" border="0">
                   <colgroup><col width="4%">
                   <col width="96%">
                   </colgroup><tbody>
                      <tr>
                         <td valign="top">
                            <p class="normal">—</p>
                         </td>
                         <td valign="top">
                            <p class="normal">all the materials of Chapter&nbsp;4 used are wholly obtained,</p>
                         </td>
                      </tr>
                   </tbody>
                </table>
                <table width="100%" cellspacing="0" cellpadding="0" border="0">
                   <colgroup><col width="4%">
                   <col width="96%">
                   </colgroup><tbody>
                      <tr>
                         <td valign="top">
                            <p class="normal">—</p>
                         </td>
                         <td valign="top">
                            <p class="normal">all the fruit juice (except that of pineapple, lime or&nbsp;grapefruit) of heading&nbsp;2009 used is originating,</p>
                            <p class="normal">and</p>
                         </td>
                      </tr>
                   </tbody>
                </table>
                <table width="100%" cellspacing="0" cellpadding="0" border="0">
                   <colgroup><col width="4%">
                   <col width="96%">
                   </colgroup><tbody>
                      <tr>
                         <td valign="top">
                            <p class="normal">—</p>
                         </td>
                         <td valign="top">
                            <p class="normal">the value of all the materials of Chapter&nbsp;17 used does not exceed 30&nbsp;% of the ex-works price of the product</p>
                         </td>
                      </tr>
                   </tbody>
                </table>
             </td>
             <td class="table" valign="top">
                <p class="normal">&nbsp;</p>
             </td>
          </tr>

我想编辑 HTML 文件以获取

<tr class="table">
             <td class="table" valign="top">
                <p class="tbl-cod">0403</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Buttermilk, curdled milk and&nbsp;cream, yoghurt, kephir and other fermented or acidified milk and&nbsp;cream, whether or not concentrated or&nbsp;containing added sugar or other sweetening matter or flavoured or&nbsp;containing added fruit, nuts or&nbsp;cocoa</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Manufacture in which: all the materials of Chapter&nbsp;4 used are wholly obtained, — all the fruit juice (except that of pineapple, lime or&nbsp;grapefruit) of heading&nbsp;2009 used is originating, — the value of all the materials of Chapter&nbsp;17 used does not exceed 30&nbsp;% of the ex-works price of the product</p>
             </td>
             <td class="table" valign="top">
                <p class="normal">&nbsp;</p>
             </td>
          </tr>

来自单元格中的所有嵌套表格。

【问题讨论】:

    标签: python html web-scraping html-table beautifulsoup


    【解决方案1】:

    是的,如果您的html 总是这样,您可以这样做。 在每个rows 中查找所有columns,然后检查该列是否有子table 然后获取所有 P 标记 w.r.t 这些列的文本并替换为 first P 标记文本。 然后分解()列中的所有表格标签。

    代码:

    html='''<tr class="table">
                 <td class="table" valign="top">
                    <p class="tbl-cod">0403</p>
                 </td>
                 <td class="table" valign="top">
                    <p class="tbl-txt">Buttermilk, curdled milk and&nbsp;cream, yoghurt, kephir and other fermented or acidified milk and&nbsp;cream, whether or not concentrated or&nbsp;containing added sugar or other sweetening matter or flavoured or&nbsp;containing added fruit, nuts or&nbsp;cocoa</p>
                 </td>
                 <td class="table" valign="top">
                    <p class="tbl-txt">Manufacture in which:</p>
                    <table width="100%" cellspacing="0" cellpadding="0" border="0">
                       <colgroup><col width="4%">
                       <col width="96%">
                       </colgroup><tbody>
                          <tr>
                             <td valign="top">
                                <p class="normal">—</p>
                             </td>
                             <td valign="top">
                                <p class="normal">all the materials of Chapter&nbsp;4 used are wholly obtained,</p>
                             </td>
                          </tr>
                       </tbody>
                    </table>
                    <table width="100%" cellspacing="0" cellpadding="0" border="0">
                       <colgroup><col width="4%">
                       <col width="96%">
                       </colgroup><tbody>
                          <tr>
                             <td valign="top">
                                <p class="normal">—</p>
                             </td>
                             <td valign="top">
                                <p class="normal">all the fruit juice (except that of pineapple, lime or&nbsp;grapefruit) of heading&nbsp;2009 used is originating,</p>
                                <p class="normal">and</p>
                             </td>
                          </tr>
                       </tbody>
                    </table>
                    <table width="100%" cellspacing="0" cellpadding="0" border="0">
                       <colgroup><col width="4%">
                       <col width="96%">
                       </colgroup><tbody>
                          <tr>
                             <td valign="top">
                                <p class="normal">—</p>
                             </td>
                             <td valign="top">
                                <p class="normal">the value of all the materials of Chapter&nbsp;17 used does not exceed 30&nbsp;% of the ex-works price of the product</p>
                             </td>
                          </tr>
                       </tbody>
                    </table>
                 </td>
                 <td class="table" valign="top">
                    <p class="normal">&nbsp;</p>
                 </td>
              </tr>'''
    
    soup=BeautifulSoup(html,'lxml')
    for row in soup.find_all('tr',class_='table'):
        for col in row.find_all('td'):
            if col.findChildren("table"):
               #Get all the p tag text from col which contains table
               ptag_text=''.join([i.text for i in col.find_all('p')])
               #Get the first p tag and replace the value with previus value
               col.find('p').next_element.replace_with(ptag_text)
               for item in col.findChildren("table"):
                    item.decompose()
    
    print(soup)
    

    输出

    <html><body><tr class="table">
    <td class="table" valign="top">
    <p class="tbl-cod">0403</p>
    </td>
    <td class="table" valign="top">
    <p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p>
    </td>
    <td class="table" valign="top">
    <p class="tbl-txt">Manufacture in which:—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p>
    
    
    
    </td>
    <td class="table" valign="top">
    <p class="normal"> </p>
    </td>
    </tr></body></html>
    

    如果您不想要这些新行,请执行 .replace 所有新行,如下所示。

    finalhtml=str(soup).replace('\n','')
    print(finalhtml)
    

    输出

    <html><body><tr class="table"><td class="table" valign="top"><p class="tbl-cod">0403</p></td><td class="table" valign="top"><p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p></td><td class="table" valign="top"><p class="tbl-txt">Manufacture in which:—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p></td><td class="table" valign="top"><p class="normal"> </p></td></tr></body></html>
    

    现在如果你想再次格式化然后试试这个

    finalhtml=str(soup).replace('\n','')
    soup=BeautifulSoup(finalhtml,'lxml')
    print(soup.prettify(formatter=None))
    

    输出

    <html>
     <body>
      <tr class="table">
       <td class="table" valign="top">
        <p class="tbl-cod">
         0403
        </p>
       </td>
       <td class="table" valign="top">
        <p class="tbl-txt">
         Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa
        </p>
       </td>
       <td class="table" valign="top">
        <p class="tbl-txt">
         Manufacture in which:—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product
        </p>
       </td>
       <td class="table" valign="top">
        <p class="normal">
        </p>
       </td>
      </tr>
     </body>
    </html>
    

    【讨论】:

    • 谢谢,几周以来我一直在寻找这个。一切正常。只是一个幼稚的问题,以确保我了解 bs4 的工作原理:如果我不美化最终的 HTML,它会被解析然后正确抓取,对吗?我的意思是,就像名字所暗示的那样,'''.prettify()''' 只是一个美观的功能,因为它的可读性?
    • @AleVesprini : 是的,你是对的。如果这解决了你的问题,请通过点击空心按钮将其标记为已接受,看看你怎么能做到这一点stackoverflow.com/help/someone-answers
    猜你喜欢
    • 2016-10-11
    • 1970-01-01
    • 2010-09-25
    • 1970-01-01
    • 1970-01-01
    • 2016-03-04
    • 2022-01-27
    • 2017-01-09
    • 1970-01-01
    相关资源
    最近更新 更多