如何将字符串从 Beautiful Soup 转换为 utf-8 编码答案

【问题标题】：How to convert a string from Beautiful Soup to utf-8 encoding如何将字符串从 Beautiful Soup 转换为 utf-8 编码
【发布时间】：2015-12-15 22:13:42
【问题描述】：

我在 python 2.7 中运行一个解析器，它从数据库中获取 xml 代码的文本字段，并使用 Beautiful Soup 在 xml 中查找和提取不同的标签。当我从 xml 中的标签中提取标签并获取给定文本时，它正在返回

<author>
<name>Josef Šimánek</name>
</author>

Josef \xc5\xa0im\xc3\xa1nek

它应该是什么样子的时候

Josef Šimánek

我的相关代码如下：

rss = str(f)
soup = BeautifulSoup(rss)
entries = soup.findAll('entry')
for entry in entries:
  author = entry.find('author')

  if author != None:
      for name in author.findAll("name"):
          if(checkNull(name).find(",") != -1):
              name = checkNull(name).split(",",1)
              for s in name:
              print s
          else: 
              print name

如您所见，代码拉出并在不同的标签之间循环，如果名称标签包含一个逗号分隔的名称列表，那么它会单独拆分并打印每个标签。

def checkNull(item):
  if item != None:
    return item.text.rstrip()
  return " "

此外，检查 null 函数只是一个辅助方法，用于查看返回的标签是否包含任何文本，如上所示。

我尝试了编码、解码和 unicode 函数以尝试解决问题，但没有一个成功。有没有其他方法可以让我尝试解决这个问题？

【问题讨论】：

究竟是什么不工作？ >>> 导入 json >>> print 'Josef \xc5\xa0im\xc3\xa1nek'.decode('utf-8') Josef Šimánek >>> print json.dumps('Josef \xc5\xa0im\xc3\xa1nek' ) "约瑟夫\u0160im\u00e1nek"
@Chainik，他可能在 Windows 上并试图在控制台中打印。 Windows 不能很好地支持 UTF-8，Python 2.7 不支持代码页 65001，这是 Windows 的 UTF-8 代码页。 @Mazar，描述您的环境并显示您在使用 .decode('utf8') 时遇到的错误将帮助我们帮助您。
对，它确实感觉环保，但我没有检查 Winblowz..

标签： python-2.7 unicode utf-8 beautifulsoup

【解决方案1】：

name 是 BeautifulSoup.Tag 类型而不是字符串，因此您可能会获得适合不支持 UTF-8 终端的对象的 __repr__ （\xc5\xa0 是 Python 字节序列š 的 UTF-8 编码）。 name.text 可能是你真正想要的值，应该是一个 Unicode 字符串。

如果您使用的是 Windows，最好避免打印到控制台，因为它的控制台不容易支持 UTF-8。您可以使用https://pypi.python.org/pypi/win_unicode_console，但将输出写入文件会更容易。

我已经对您的代码进行了一些清理以使其更简单（快速空检查）并将您的输出写入 UTF-8 编码文件：

# io provides better access to files with working universal newline support
import io
# open a file in text mode, encoding all output to utf-8
output_file = io.open("output.txt", "w", encoding="utf-8")

rss = str(f)
soup = BeautifulSoup(rss)
entries = soup.findAll('entry')
for entry in entries:
  author = entry.find('author')

  # If not null or not empty
  if author:
      for name in author.findAll("name"):
          # .text contains the actual Unicode string value
          if name.text:
              names = name.text.split(",", 1)
              # If string contained a comma, you'll have two elements in a list
              # else you'll just have the 1 length list
              for flname in names:
                  # remove any whitespace on either side
                  output_file.write(flname.strip() + "\n")

output_file.close()

【讨论】：