从 BeautifulSoup html 解析器获取干净的文本文件答案

【问题标题】：Getting clean text file from BeautifulSoup html parser从 BeautifulSoup html 解析器获取干净的文本文件
【发布时间】：2018-02-13 13:29:48
【问题描述】：

在尝试对 Project Gutenberg 文件执行文本分析时，我在使用 BeautifulSoup 时遇到了很多问题（请参阅此处以获取 yesterday's solved problem）。我几乎把所有代码都整理好了，但最后一个问题让我感到困惑：在我从 BeautifulSoup 清理的版本中消除了一些冗余文本之后，如何获得一个干净的文本文件。让我解释一下：

第 1 步：我在记录文本标题时提取文本减去 html 垃圾：

from bs4 import BeautifulSoup
import re

### Opens saved html file
html = open("/filepath/Jane_Eyre_Test.htm")

### Cleans html file
soup = BeautifulSoup(html, 'html.parser')


title = re.findall(r'<title>(.*?)</title>',soup.get_text())

第 2 步：删除样板古腾堡许可证文本，以免混淆分析：

s1 = '***START OF THE PROJECT GUTENBERG EBOOK '+title[0].upper()+'***'

s2 = '***END OF THE PROJECT GUTENBERG EBOOK '+title[0].upper()+'***'

main_text = soup.get_text()[(soup.get_text().index(s1)+len(s1)):soup.get_text().index(s2)]

第 3 步：打开文本文件将结果写入：

#### Opens blank text file
f = open('filepath/'+titles[0]+'.txt', 'w')
f.write(main_text)

现在，问题来了：当我这样做时，生成的文本文件充满了格式化标签，例如：

转录自 1897 年服役 &大卫·普莱斯的佩顿版，电子邮件 ccx074@pglaf.org

但是当我尝试如下使用美丽的汤来清洁它时，

main_text1 = BeautifulSoup(main_text, 'html.parser')
f.write(main_text1.get_text())

结果也好不了多少。

</pre> <p><a name="startoftext"></a></p> <p>Transcribed from the 1897
Service &amp; Paton edition by David Price, email ccx074@pglaf.org</p>

尽管

f.write(soup.get_text())

生成格式完美的文本文件。我怀疑我在这里遗漏了文本格式和 html 格式之间的一些关键区别；如果是这样，任何指示表示赞赏。当然，任何摆脱文本格式标签的解决方案都将受到更多赞赏。

【问题讨论】：

您是否尝试将解析器更改为 lxml 或 html5lib？
刚试了一下，却得到：“FeatureNotFound：找不到具有您要求的功能的树生成器：html5lib.parser。您需要安装解析器库吗？” html5lib 和 lxml 的 pip install 说它们已经安装了。
是的，你需要通过pip install html5lib安装html5lib

标签： python html python-3.x beautifulsoup

【解决方案1】：

您可以在消除一些冗余文本后得到一个干净的文本文件。你follow from this

>>> with open("Book_titles.txt", "w") as file:
...     for line in x1:
...             file.writelines(line)
...             file.writelines('\n')
...
>>>

【讨论】：

【解决方案2】：

尝试以下方法，get_text() 应该可以在 soup 对象上正常工作：

from bs4 import BeautifulSoup
import re

with open('Jane_Eyre_Test.htm') as f_jane_html:
    soup = BeautifulSoup(f_jane_html, "html.parser")

a = soup.find('a', attrs={"name" : "startoftext"})
text = a.parent.parent.get_text()

start = re.escape("***START OF THE PROJECT GUTENBERG EBOOK JANE EYRE***")
end = re.escape("***END OF THE PROJECT GUTENBERG EBOOK")
text = re.search('{}(.*){}'.format(start, end), text, re.S).group(1)

with open('Jane_Eyre.txt', 'w') as f_jane_text:
    f_jane_text.write(text)

这会给你一个文件，开始和结束如下：

Transcribed from the 1897 Service & Paton edition by David

Price, email ccx074@pglaf.org
JANE EYRE

AN AUTOBIOGRAPHY
by
.
.
.
I come quickly!’ and hourly I more eagerly

respond,—‘Amen; even so come, Lord

Jesus!’”

用于测试的 HTML 取自 Jane Eyre, by Charlotte Bronte

测试文件的创建如下：

import requests

r = requests.get("http://www.gutenberg.org/files/1260/1260-h/1260-h.htm")

with open('Jane_Eyre_Test.htm', 'w') as f_jane_eyre:
    f_jane_eyre.write(r.content)

【讨论】：

这确实是一个极好的——而且速度很快！——解决方案。它也正是我正在使用的 html 源代码。但是，当我运行它时，我在text = a.parent.parent.get_text() 之后得到一个 AttributeError 即：AttributeError: 'NoneType' object has no attribute 'parent' 这可能是由于我事先下载了 html 文件而不是通过 URL 访问它造成的吗？
我使用requests.get() 从该链接下载 HTML 并将其直接写入文件。这意味着您的 HTML 结构略有不同。你可以试试a.parent.get_text()
恐怕a.parent.get_text() 不走运。如果只是偶尔的文件，我也会使用requests.get()；但是，我正在对 Gutenberg 进行大量分析，他们喜欢的 wget 界面只允许直接下载到磁盘。
我已经更新了答案来展示如何使用requests.get() 来创建文件。