使用 html.parser 的 Beautiful Soup 无法解码引号答案

【问题标题】：Beautiful Soup using html.parser having troubles decoding quotation marks使用 html.parser 的 Beautiful Soup 无法解码引号
【发布时间】：2025-12-13 10:20:04
【问题描述】：

我有一个简单的程序可以从 Fox News 获取文章的文本，但由于某种原因，我无法正确解码引号。

from bs4 import BeautifulSoup
import urllib

r = urllib.urlopen('http://www.foxnews.com/politics/2016/10/14/emails-reveal-clinton-teams-early-plan-for-handling-bill-sex-scandals.html').read()
soup = BeautifulSoup(r, 'html.parser')

for item in soup.find_all('div', class_='article-text'):
    print item.get_text().encode('UTF-8')

这会抓取我要查找的文本，但对于文章中的几乎所有引号，它们的打印方式如下：Bill Clinton's。我已经尝试专门将解码定义为 utf-8，并查看了页面以查看它声明的编码，它也是 utf-8，所以我不确定为什么会这样。

【问题讨论】：

你从哪里运行这个？
我只是在我的笔记本电脑上运行它，使用 Eclipse Mars 和 pydev。
我的意思是你使用的是什么操作系统？
对不起，我正在运行 Windows 10。

标签： python beautifulsoup html-parsing

【解决方案1】：

所以这并不能解决为什么 Beautiful Soup 在解码文本时遇到问题，但我找到了两种迂回的方法来解决这个问题。一种是在脚本顶部声明一个编码：

      # This Python file uses the following encoding: utf-8

另一种是解码并删除所有Unicode字符，然后再用ascii编码。

print(temp.decode('unicode_escape').encode('ascii','ignore'))

【讨论】：