【问题标题】:Unicode issues when writing to CSV file写入 CSV 文件时的 Unicode 问题
【发布时间】:2016-09-20 17:38:58
【问题描述】:

我需要一些指导。我正在使用以下代码:

import requests
import bs4
import csv

results = requests.get('http://grad-schools.usnews.rankingsandreviews.com/best-graduate-schools/top-engineering-schools/eng-rankings?int=a74509')

reqSoup = bs4.BeautifulSoup(results.text, "html.parser")
i = 0
schools = []

for school in reqSoup:
    x = reqSoup.find_all("a", {"class" : "school-name"})
    while i < len(x):
        for name in x:
            y = x[i].get_text()
            i += 1
            schools.append(y)

with open('usnwr_schools.csv', 'wb') as f:
    writer = csv.writer(f)
        for y in schools:
        writer.writerow([y])

我的问题是 em-dashes 在生成的 CSV 文件中显示为 utf-8。我尝试了几种不同的方法来修复它,但似乎没有任何效果(包括attempting to use regex 来摆脱它,以及尝试几年前的.translate method that I found in a StackOverflow 问题)。

我错过了什么?我希望 csv 结果只包含文本,减去破折号。

我使用的是 Python 3.5,并且对 Python 还很陌生。

【问题讨论】:

  • 你如何期望 em-dashes 出现? Unicode 是字符的抽象枚举;文件是一个字节序列。 UTF-8 是将 Unicode 字符编码为一个或多个字节的默认方法。如果您想删除 em-dashes 或用其他东西替换它们,您需要自己做;这不是编码器的工作。
  • 所有您的数据显示为 UTF-8(显然这是您的语言环境的首选编码,您在打开文件时没有设置 encoding) .你想展示什么?其余的文本仍然是 UTF-8(即使文本也可以编码为 ASCII)。
  • 请注意,csv 模块只是以特定格式写入数据。您将数据传递给您要写入的编写器。这意味着这不是csv 模块问题;看来您想传递不同的数据,所以也许您的问题应该是如何将数据限制为仅包含 ASCII 字符(大概这就是您想要的,只是 a-z、A-Z、0-9 和基本标点符号)。跨度>
  • 是的:这正是我想要的。如果我的问题令人困惑,我深表歉意。我希望最终的 CSV 数据仅包含文本和基本标点符号,但尚未找到有关如何执行此操作的任何指导。
  • 您必须自己替换您不想要的所有内容(我的回答),或者只使用允许的代码点白名单并将其他代码点替换为空字符串。

标签: python python-3.x unicode


【解决方案1】:

要删除破折号,请尝试y.replace("—","-").replace("–","-")(第一个是破折号到减号,第二个是破折号到减号)

如果您只想要 ASCII 码点,您可以删除其他所有内容

import string
whitelist=string.printable+string.whitespace
def clean(s):
    return "".join(c for c in s if c in whitelist)

(这仅对纯英文文本产生最合理的结果)

顺便试一下

open('usnwr_schools.csv', 'w', newline='', encoding='utf-8') # or whatever encoding you like

因为在 Python 3 中,csv.writer 采用的文本文件不像 Python 2 中那样采用二进制文件(您以二进制模式打开它 ("wb"))

【讨论】:

  • 非常感谢:我喜欢白名单方法(我尝试使用替换,但没有运气)。然而,我仍然得到相同的结果,如下所示:b'University of Michigan\xe2\x80\x94\xe2\x80\x8bAnn Arbor'
【解决方案2】:

学习接受 Unicode...世界不再是 ASCII。

假设您在 Windows 上并使用 Excel 或记事本查看 .CSV,请在 Python 3 上使用以下行。仅通过此更改(并修复您的帖子的缩进),您甚至可以查看非 ASCII字符正确。记事本和 Excel 就像 utf-8-sig 提供的文件开头的 UTF-8 BOM 签名。

with open('usnwr_schools.csv', 'w', newline='', encoding='utf-8-sig') as f:

如果在另一个 Python 脚本中读取文件,请确保使用以下内容读取文件。您阅读的 b'University of Michigan\xe2\x80\x94\xe2\x80\x8bAnn Arbor' 的示例是以二进制模式阅读的 'rb'

with open('usnwr_schools.csv', encoding='utf-8-sig') as f:

如果在 Linux 上,您可以使用 utf8 而不是 utf-8-sig

顺便说一句,您可以将循环替换为:

with open('usnwr_schools.csv', 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.writer(f)
    for school in reqSoup:
        x = reqSoup.find_all("a", {"class" : "school-name"})
        for item in x:
            y = item.get_text()
            writer.writerow([y])

回读:

with open('usnwr_schools.csv',encoding='utf-8-sig') as f:
    print(f.read())

输出:

Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington

如果您仍然只想成为 ASCII,这将做到:

import requests
import bs4
import csv

results = requests.get('http://grad-schools.usnews.rankingsandreviews.com/best-graduate-schools/top-engineering-schools/eng-rankings?int=a74509')

replacements = {ord('\N{EN DASH}'):'-',
                ord('\N{EM DASH}'):'-',
                ord('\N{ZERO WIDTH SPACE}'):None}

reqSoup = bs4.BeautifulSoup(results.text, "html.parser")

with open('usnwr_schools.csv', 'w', newline='', encoding='ascii') as f:
    writer = csv.writer(f)
    for school in reqSoup:
        x = reqSoup.find_all("a", {"class" : "school-name"})
        for item in x:
            y = item.get_text()
            writer.writerow([y.translate(replacements)])

with open('usnwr_schools.csv',encoding='ascii') as f:
    print(f.read())

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2014-07-07
    • 2021-05-07
    • 1970-01-01
    • 1970-01-01
    • 2015-12-10
    • 2021-12-29
    • 1970-01-01
    • 2017-03-28
    相关资源
    最近更新 更多