如何清理美丽汤的输出答案

【问题标题】：How do I clean up Beautiful soup's output如何清理美丽汤的输出
【发布时间】：2021-04-02 00:05:50
【问题描述】：

我正在尝试从网站上抓取一本书，并在使用 Beautiful Soup 进行解析时发现出现了一些错误。比如这句话：

“你有 more&hellip; 直接控制你的 skaa。有多少会“哦，半打左右，”

"more&hellip;" 和“woul”都是在脚本某处发生的错误。

有没有办法自动清除这样的错误？我所拥有的示例代码如下。


import requests
from bs4 import BeautifulSoup
url = 'http://thefreeonlinenovel.com/con/mistborn-the-final-empire_page-1'
res = requests.get(url)
text = res.text
soup = BeautifulSoup(text, 'html.parser')
print(soup.prettify())




trin = soup.tr.get_text()
final = str(trin)
print(final)

【问题讨论】：

这能回答你的问题吗？ Using BeautifulSoup to get_text of td tags within a resultset
我找不到任何其他方法来解决这个问题，所以我只制作了另一个主要使用 pandas 的脚本，它运行良好。谢谢（你的）信息！我把这个问题留了下来，以防其他人能帮我弄清楚美丽的汤。因为我很想使用它。

标签： python python-3.x beautifulsoup html-entities

【解决方案1】：

您需要转义转换为详细的 html 实体 here。但是，要适用于您的情况并保留文本，您可以使用 stripped_strings：

import requests
from bs4 import BeautifulSoup
import html

url = 'http://thefreeonlinenovel.com/con/mistborn-the-final-empire_page-1'
res = requests.get(url)
text = res.text
soup = BeautifulSoup(text, 'lxml')

for r in soup.select_one('table tr').stripped_strings:
    s = html.unescape(r)
    print(s)

【讨论】：