【发布时间】:2020-05-02 03:11:16
【问题描述】:
我正在使用 BeautifulSoup 抓取网页中的表格。我设法将文本放入 txt 文件中。
但是,有些内部包含多个表格。我猜开发人员有一些审美指令,他们无法以任何其他方式编辑单元格以满足他们的要求。我在按原样抓取表格时遇到了很多问题,所以我想知道是否存在一种以编程方式编辑 HTML 的方法,以便将这些嵌套表格中的 txt 外推到原始单元格中。
这是我的意思的一个例子。
来自这样的嵌套表
<tr class="table">
<td class="table" valign="top">
<p class="tbl-cod">0403</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Manufacture in which:</p>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<colgroup><col width="4%">
<col width="96%">
</colgroup><tbody>
<tr>
<td valign="top">
<p class="normal">—</p>
</td>
<td valign="top">
<p class="normal">all the materials of Chapter 4 used are wholly obtained,</p>
</td>
</tr>
</tbody>
</table>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<colgroup><col width="4%">
<col width="96%">
</colgroup><tbody>
<tr>
<td valign="top">
<p class="normal">—</p>
</td>
<td valign="top">
<p class="normal">all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,</p>
<p class="normal">and</p>
</td>
</tr>
</tbody>
</table>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<colgroup><col width="4%">
<col width="96%">
</colgroup><tbody>
<tr>
<td valign="top">
<p class="normal">—</p>
</td>
<td valign="top">
<p class="normal">the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p>
</td>
</tr>
</tbody>
</table>
</td>
<td class="table" valign="top">
<p class="normal"> </p>
</td>
</tr>
我想编辑 HTML 文件以获取
<tr class="table">
<td class="table" valign="top">
<p class="tbl-cod">0403</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Manufacture in which: all the materials of Chapter 4 used are wholly obtained, — all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating, — the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p>
</td>
<td class="table" valign="top">
<p class="normal"> </p>
</td>
</tr>
来自单元格中的所有嵌套表格。
【问题讨论】:
标签: python html web-scraping html-table beautifulsoup