【问题标题】:How to display a Unicode character in Python如何在 Python 中显示 Unicode 字符
【发布时间】:2016-12-28 06:36:43
【问题描述】:

我有一个包含重音字符的文本文件,例如:“č”、“š”、“ž”。当我用 Python 程序读取这个文件并将文件内容放入 Python 列表时,重音字符会丢失,Python 会用其他字符替换它们。例如:“č”替换为“_”。当我从文件中读取重音字符时,有谁知道如何将重音字符保留在 Python 程序中?我的代码:

import sqlite3 #to work with relational DB

conn = sqlite3.connect('contacts.sqlite') #connect to db 
cur = conn.cursor() #db connection handle

cur.execute("DROP TABLE IF EXISTS contacts")

cur.execute("CREATE TABLE contacts (id INTEGER, name TEXT, surname  TEXT, email TEXT)")

fname = "acos_ibm_notes_contacts - test.csv"
fh = open(fname) #file handle
print " "
print "Reading", fname
print " "

#--------------------------------------------------
#First build a Python list with new contacts data: name, surname and email address

lst = list() #temporary list to hold content of the file
new_contact_list = list() #this list will contain contatcs data: name, surname and email address
count = 0 # to count number of contacts
id = 1 #will be used to add contacts id into the DB
for line in fh: #for every line in the file handle
    new_contact = list()
    name = ''
    surname = ''
    mail = ''
    #split line into tokens at each '"' character and put tokens into  the temporary list
    lst = line.split('"')
    if lst[1] == ',': continue #if there is no first name, move to next line
    elif lst[1] != ',': #if 1st element of list is not empty
        name = lst[1] #this is the name
        if name[-1] == ',': #If last character in name is ','
        name = name[:-1] #delete it
        new_contact.append({'Name':name}) #add first name to new list of contacts
        if lst[5] != ',': #if there is a last name in the contact data
            surname = lst[5] #assign 5th element of the list to surname
            if surname[0] == ',': #If first character in surname is ','
                surname = surname[1:] #delete it
            if surname[-1] == ',': #If last character in surname is ','
                surname = surname[:-1] #delete it
            if ',' in surname: #if surname and mail are merged in same list element
                sur_mail = surname.split(',') #split them at the ','
                surname = sur_mail[0]
                mail = sur_mail[1]
            new_contact.append({'Surname':surname}) #add last name to new list of contacts
            new_contact.append({'Mail':mail}) #add mail address to new list of contacts
        new_contact_list.append(new_contact)
    count = count + 1

fh.close()
#--------------------------------------------------
# Second: populate the DB with data from the new_contact_list

row = cur.fetchone()
id = 1
for i in range(count):
    entry = new_contact_list[i] #every row in the list has data about 1 contact - put it into variable
    name_dict = entry[0] #First element is a dictionary with name data
    surname_dict = entry[1] #Second element is a dictionary with surname data
    mail_dict = entry[2] #Third element is a dictionary with mail data
    name = name_dict['Name']
    surname = surname_dict['Surname']
    mail = mail_dict['Mail']
    cur.execute("INSERT INTO contacts VALUES (?, ?, ?, ?)", (id, name, surname, mail))
    id = id + 1               

conn.commit() # Commit outstanding changes to disk 

----------------------------------

这是程序的简化版,没有DB,只是打印到屏幕上

import io
fh = io.open("notes_contacts.csv", encoding="utf_16_le") #file handle

lst = list() #temporary list to hold content of the file
new_contact_list = list() #this list will contain the contact name,    surname and email address
count = 0 # to count number of contacts
id = 1 #will be used to add contacts id into the DB
for line in fh: #for every line in the file handle
    print "Line from file:\n", line # print it for debugging purposes
    new_contact = list()
    name = ''
    surname = ''
    mail = ''
    #split line into tokens at each '"' character and put tokens into  the temporary list
    lst = line.split('"')
    if lst[1] == ',': continue #if there is no first name, move to next line
    elif lst[1] != ',': #if 1st element of list is not empty
        name = lst[1] #this is the name
        print "Name in variable:", name # print it for debugging purposes
        if name[-1] == ',': #If last character in name is ','
            name = name[:-1] #delete it
            new_contact.append({'Name':name}) #add first name to new list of contacts
        if lst[5] != ',': #if there is a last name in the contact data
            surname = lst[5] #assign 5th element of the list to surname
            print "Surname in variable:", surname # print it for debugging purposes
            if surname[0] == ',': #If first character in surname is ','
                surname = surname[1:] #delete it
            if surname[-1] == ',': #If last character in surname is ','
                surname = surname[:-1] #delete it
            if ',' in surname: #if surname and mail are merged in same list element
                sur_mail = surname.split(',') #split them at the ','
                surname = sur_mail[0]
                mail = sur_mail[1]
            new_contact.append({'Surname':surname}) #add last name to new list of contacts
            new_contact.append({'Mail':mail}) #add mail address to new list of contacts
        new_contact_list.append(new_contact)
        print "New contact within the list:", new_contact # print it for debugging purposes

fh.close()

这是 notes_contacts.csv 文件的内容,只有 1 行:

Aco,"",Vidovič,aco.vidovic@si.ibm.com,+38613208872,"",+38640456872,"","","","","","","","",""

【问题讨论】:

  • 请出示一些代码。
  • 你是在使用codecs读取文件吗?
  • 尝试用utf-8编码打开文件,open(Filename, 'r', encoding='utf-8')
  • @Aco,请在问题中添加代码,而不是在 cmets 中。您可以在发布后最多 5 分钟编辑 cmets

标签: python python-2.7 unicode python-unicode


【解决方案1】:

在 Python 2.7 中,默认文件模式是二进制。相反,您需要以文本模式打开文件并将文本解码为 Python 3 中的文本。您不必在读取文件时对文本进行解码,但它可以让您不必担心以后代码中的编码。

添加到顶部:

import io

变化:

 fh = io.open(fname, encoding='utf_16_le')

注意:您始终需要传入 encoding,因为 Python 无法原生猜测编码。

现在,每次read(),文本都会被转换为 Unicode 字符串。

SQLite 模块接受 TEXT 作为 Unicode 或 UTF-8 编码的 str。由于您已经将文本解码为 Unicode,因此您无需执行任何其他操作。

为确保 SQLite 不会尝试将 SQL 命令的主体编码回 ASCII 字符串,请通过将 u 附加到字符串来将 SQL 命令更改为 Unicode 字符串。

例如

cur.execute(u"INSERT INTO contacts VALUES (?, ?, ?, ?)", (id, name, surname, mail))

Python 3 将帮助您避免其中一些怪癖,您只需执行以下操作即可使其正常工作:

fh = io.open(fname, encoding='utf_16_le')

由于您的数据看起来像标准的 Excel 方言 CSV,因此您可以使用 CSV 模块来拆分数据。 DictReader 允许您传递列名,这使得解析您的字段变得非常容易。不幸的是,Python 的 2.7 CSV 模块不是 Unicode 安全的,因此您需要使用 Py3 反向端口:https://github.com/ryanhiebert/backports.csv

您的代码可以简化为:

from backports import csv
import io

csv_fh = io.open('contacts.csv', encoding='utf_16_le')

field_names = [u'first_name', u'middle_name', u'surname', u'email',
               u'phone_office', u'fax', u'phone_mobile', u'inside_leg_measurement']

csv_reader = csv.DictReader(csv_fh, fieldnames=field_names)

for row in csv_reader:
    if not row['first_name']: continue

    print u"First Name: {first_name}, " \
          u"Surname: {surname} " \
          u"Email: {email}".format(first_name=row['first_name'],
                                   surname=row['surname'],
                                   email=row['email'])

【讨论】:

  • 非常感谢@Alastair McCormack。我按照您的建议做了,结果仍然相同(重音字符乱码),但它帮助我确定了问题出现的位置:当我使用 list.append() 方法时。也就是说,当带有重音字符的单词在变量中时,所有重音字符仍然存在(即,当我打印变量时我可以看到它们)。但是当我使用 list.append(variable) 将此变量移动到列表中时,重音字符会被打乱。例如,字母“č”变为“\xc4\x8d”。有没有办法克服这个,你知道吗?
  • 有些事情你没有告诉我 :) '\xc4\x8d' 是 'č'UTF-8 编码。如果您的数据是 UTF16,那么除非您对 UTF-8 进行编码,否则您不会看到这一点。我猜当您从 SQLite 中 SELECT 数据时会发生这种情况?
  • 我不认为是这两个。我通过删除数据库部分简化了代码,现在我只从文件中读取并打印显示:文件中的第一行打印('č'可见),然后打印变量('č' 仍然可见),然后打印列表('č' 乱码)。我的代码中没有提到 UTF-8,只有 UTF_16_le。我确实将此语句添加到文件顶部:coding = "UTF-16-le"。多亏了这个 'č' 在行打印和可变打印中显示。我可以向您展示新的简化代码(如果我找到放置它的位置)。难道是list.append()自己把编码改成了UTF-8?
  • 将字符串附加到列表不会更改其编码。听起来您的输入数据不是“UTF-16-le”,而是以某种方式更改为“UTF-8”。请删除coding = "UTF-16-le",因为这仅指源代码的编码。使用您的新代码在底部的问题中添加一个部分
  • 我将新的简化代码添加到我的问题中。非常感谢您抽出宝贵时间!我欠你一杯啤酒(或你选择的其他饮料)。 :) 我真的很想知道“谁”改变了编码。
【解决方案2】:

尝试在代码程序的第一行使用# coding=utf-8

【讨论】:

  • 那无济于事。 Python 编码指令 cmets 是为了让解释器知道脚本文件本身是用什么编码编码的。它与脚本如何处理外部文本无关。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2013-01-31
  • 2013-10-22
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多