如何使用 BeautifulSoup 写入文件非英语语言答案

【问题标题】：How to write to file non-English language using BeautifulSoup如何使用 BeautifulSoup 写入文件非英语语言
【发布时间】：2018-11-08 14:31:44
【问题描述】：

我正在通过 BeautifulSoup 和 Python 学习网页抓取。我的第一个项目是从 cookpad.hu 中提取某些食谱。由于这个错误，我能够成功提取，但现在我在将它们实际写入文件时遇到了麻烦（我只知道怎么做）：

Traceback（最近一次调用最后一次）：文件“cookpad_scrape.py”，第 24 行，在 f.writerow(about_clean) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)

我的代码如下。我在 Ubuntu 上使用 Python 2.7.14。网页的pastebin是here，但网页本身是this。

我假设它不会写匈牙利字母？我确信我忽略了一个非常简单的解决方案。

import requests
from bs4 import BeautifulSoup 
import csv 

'''
Tree of page:
    <div id="recipe main">
        <div id="editor" class="editor">
            <div id="about">
            <section id="ingredients">
            <section id="steps">
'''
#text only: soup.get_text()

page = requests.get('https://cookpad.com/hu/receptek/5040119-parazson-sult-padlizsankrem')
soup = BeautifulSoup(page.text, 'lxml')

f = csv.writer(open('recipes.csv', 'w')) #create and open file in f variable, using 'w' mode
f.writerow(['Recipe 1']) #write top row headings

about = soup.find(id='about')
about_ext = about.p.extract()
about_clean = about_ext.get_text()
f.writerow(about_clean)

ingredients = soup.find(id='ingredients')
ingredients_ext = ingredients.ol.extract()
ingredients_clean = ingredients_ext.find_all(itemprop='ingredients')
#for ingredient in ingredients_clean:

steps = soup.find(id='steps')
steps_p = steps.find_all(itemprop='recipeInstructions')
for step in steps_p:
    extracted = step.p.extract()
    print(extracted.text)
    f.writerow([extracted])

解决方案：使用 python3 运行脚本，而不是通过 python3 my_script.py 运行 2

新问题：导出刮痕让我的步骤得到了很好的结果，但是成分和关于部分的每个字母都用commas分隔。

【问题讨论】：

这是 Python 2 还是 3？（如果是 3，那么 3.x 版本是什么，你在什么平台上，如果是 Linux，是什么语言环境/如果是 Windows，是什么 OEM 代码页？）
请包括整个堆栈跟踪，而不仅仅是错误消息。它显示哪一行出错。还有，什么版本的python，2还是3？
另外，请给我们整个例外情况——带回溯——而不仅仅是描述字符串。我可以猜测它可能导致这种情况的writerow调用之一，但异常会告诉我们确切的哪一行。
最后，如果你能给我们一个完整（但最小的）HTML 树，而不是仅仅一个无法解析的片段，我们实际上可以运行并自己调试您的代码。请阅读帮助中的minimal reproducible example，了解有关问题中包含内容的更多指南。
附带说明：看起来您正在尝试编写仅包含一列的 CSV，其值只是不包含换行符或其他控制字符的简单字符串？如果是这样，你真的不需要 CSV；您可以直接将行写入文件。（如果您的数据中可能有换行符或其他控制字符，请忽略此注释。）

标签： python python-2.7 unicode web-scraping beautifulsoup

【解决方案1】：

你正在运行 python2。在第 25 行，您正在写出“about_clean”变量的内容。您需要对该值进行编码。

f.writerow(about_clean.encode("utf-8"))

【讨论】：

这无济于事，因为他有多个writerow 调用，并且因为每个调用都需要一个list 字符串，而不是单个字符串。（因为这是writerow 的全部point。）
然后他可能会使用：f.writerow([v.encode("utf-8") for v in about_clean]) 并且可能会创建一个实用函数。我相信还有 unicodecsv 模块。
我不知道unicodecsv 模块——尽管可能至少有一个。（我确实知道一个 utf8csv 模块，它封装了文档示例中的内容，然后将它们简化为仅处理 UTF-8，因为我编写了它……但老实说，我认为最好阅读文档中的示例，因为在不了解这些示例在做什么的情况下，您永远不会在 2.x 中扩展或调试 Unicode CSV 代码……）