如何在 Python 中解析 .TXT 格式（无标签）的 HTML 文件？答案

【问题标题】：How to parse HTML file in .TXT format (un-tabbed) in Python?如何在 Python 中解析 .TXT 格式（无标签）的 HTML 文件？
【发布时间】：2019-04-05 00:45:17
【问题描述】：

我在编程中遇到了一个让我难过的问题。

我正在尝试访问存储在大量旧的 HTML 格式另存为文本文件中的数据。但是，当保存 HTML 代码时，它会丢失缩进、制表符、层次结构，无论您想如何称呼它。这方面的一个例子可以在下面找到。

......

<tr class="ro">
<td class="pl " style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_RevenueFromContractWithCustomerExcludingAssessedTax', window );">Net sales</a></td>
<td class="nump">$ 123,897<span></span>
</td>
<td class="nump">$ 122,136<span></span>
</td>
<td class="nump">$ 372,586<span></span>
</td>
<td class="nump">$ 360,611<span></span>
</td>
</tr>
<tr class="re">
<td class="pl " style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_OtherIncome', window );">Membership and other income</a></td>
<td class="nump">997<span></span>
</td>
<td class="nump">1,043<span></span>
</td>
<td class="nump">3,026<span></span>
</td>
<td class="nump">3,465<span></span>
</td>
</tr>
<tr class="rou">
<td class="pl " style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_Revenues', window );">Total revenues</a></td>
<td class="nump">124,894<span></span>
</td>
<td class="nump">123,179<span></span>
</td>
<td class="nump">375,612<span></span>
</td>
<td class="nump">364,076<span></span>
</td>
</tr>

我通常会在这里使用 Beautiful Soup 并开始以这种方式解析数据，但我还没有找到好的工作流程，因为从技术上讲，这里没有层次结构；我不能告诉 BS 去查看文档本身以外的其他内容——这很庞大，而且可能太耗时（见下一条语句）。

我还需要找到一个彻底的解决方案，而不是快速解决方案，因为我有数百个（如果不是数千个）相同的 HTML 到文本文件要解析。

所以我的问题是，如果我想在所有文件中返回“会员资格和其他收入”的第一个数字（在本例中为 997），我该怎么做呢？强>

可以在此处找到两个示例文件：

(https://www.sec.gov/Archives/edgar/data/1800/0001104659-18-065076.txt) (https://www.sec.gov/Archives/edgar/data/1084869/0001437749-18-020205.txt)

编辑 - 4/16

感谢大家的回复！我已经编写了一些代码来返回我正在寻找的标签。

import requests
from bs4 import BeautifulSoup

data = requests.get('https://www.sec.gov/Archives/edgar/data/320193/0000320193-18-000070.txt')

# load the data
soup = BeautifulSoup(data.text, 'html.parser')

# get the data
for tr in soup.find_all('tr', {'class':['rou','ro','re','reu']}):
    db = [td.text.strip() for td in tr.find_all('td')]
    print(db)

问题是有大量的退货，而且大多数都没有任何用处。有没有办法根据这些标签的祖父母进行过滤？我已经尝试过使用头、标题、正文等与上面相同的方法，但我不能完全让 BS 识别文件名..

<DOCUMENT>
<TYPE>XML
<SEQUENCE>14
**<FILENAME>R2.htm**
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
<html>
<head>
<title></title>
.....removed for brevity
</head>
<body>
.....removed for brevity
<td class="text">&#160;<span></span>
</td>
.....removed for brevity
</tr>

【问题讨论】：

你为什么认为没有等级制度？缩进并不重要，重要的是 HTML 标签。 <table> 包含 <tr>，其中包含 <td>。您可以遍历 <tr> 行。
HTML 是文本，它恰好包含标记标签。
如果你有兴趣，可以用lxml来完成。

标签： python html text beautifulsoup

【解决方案1】：

请注意，HTML 并不关心缩进。如果你真的想，它可以都在同一行，中间没有空格。 HTML 解析器只会查看标签的结构。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all['<tag you are looking for>'][0]

【讨论】：