Python：Beautifulsoup 为 tis-620、charset windows-874 返回错误的解码答案

【问题标题】：Python: Beautifulsoup returns wrong decode for tis-620, charset windows-874Python：Beautifulsoup 为 tis-620、charset windows-874 返回错误的解码
【发布时间】：2014-02-23 18:46:26
【问题描述】：

我正在尝试使用代码 tis-620 阅读泰语网站并将结果转换为 utf-8，以便我可以上传到任何数据库。

我注意到beautifulsoup 的行为。

简单例子

来源：

"""http-equiv="Content-Type" content="text/html; charset=windows-874"

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
     <html xmlns="http://www.w3.org/1999/xhtml"><head>

    <meta http-equiv="Content-Type" content="text/html; charset=windows-874" />
    <meta name="description" content="Discussion Forum" />
    </head>
    <body>
        hello
        <dl>
            <dt>no  English Thai    abbrev. phonemic</dt>
            <dt>01  January มกราคม  ม.ค.    mohkH gaL raaM khohmM</dt>
            <dt>02  February    กุมภาพันธ์  ก.พ.    goomM phaaM phanM</dt>
            <dt>03  March   มีนาคม  มี.ค.   meeM naaM khohmM</dt>
            <dt>04  April   เมษายน  เม.ย.   maehM saaR yohnM</dt>
            <dt>05  May พฤษภาคม พ.ค.    phreutH saL phaaM khohmM</dt>
            <dt>06  June    มิถุนายน    มิ.ย.   miH thooL naaM yohnM</dt>
            <dt>07  July    กรกฎาคม ก.ค.    gaL raH gaL daaM khohmM</dt>
            <dt>08  August  สิงหาคม ส.ค.    singR haaR khohmM</dt>
            <dt>09  September   กันยายน ก.ย.    ganM yaaM yohnM</dt>
            <dt>10  October ตุลาคม  ต.ค.    dtooL laaM khohmM</dt>
            <dt>11  November    พฤศจิกายน   พ.ย.    phreutH saL jiL gaaM yohnM</dt>
            <dt>12  December    ธันวาคม ธ.ค.    thanM waaM khohmM</dt>
        </dl>
   </body>
   </html>

Python pgm：

1。示例

import urllib2
from bs4 import BeautifulSoup
from datetime import date
import sys

sys.setdefaultencoding("tis-620")
reload(sys)
fthai = open('translation_thai.html','r')
soup = BeautifulSoup(fthai.read())
ln = soup.findAll('dl')
translation=[]
dttag = ln[0].findNext('dt')
for el in dttag.nextSiblingGenerator():
    if el <> u'\n':
        ss = el.get_text()
        vsep = el.get_text().encode('utf8').split('\t')
        line=[]
        for x in range(0,3):
            line.append(vsep[x])
        translation.append(line)
fthai.close()

ss 返回 u'03\tMarch\t\xc1\xd5\xb9....

vsep 返回 ['03','March','\xc3\x81\xc3......

根据utf8表有什么问题

2。没有beautifulsoup的例子

import urllib2
import sys

sys.setdefaultencoding("tis-620")
reload(sys)
fthai = open('translation_thai.html','r')
array = []
x=-1
for line in fthai:
    x=x+1
    array.append( line )
    print array[x],x
sp = array[17].split('\t')
sp1 = sp[2].encode('utf8')

sp 返回 ['03','March','\xc1\xd5\xb9......

sp1 返回 '\xe0\xb8\xa1\xe0\xb8\xb5

正确！

根据utf8表

3617 e21 ม E0B8A1 1110 0000 1011 1000 1010 0001 ม

3637 e35 ี E0B8B5 1110 0000 1011 1000 1011 0101 ี

有没有人告诉我如何解决错误的行为。

【问题讨论】：

如果你在没有sys.setdefaultencoding()的情况下运行BS4会发生什么？
同上，返回u'03\tMarch\t\xc1\xd5\xb9.... vsep返回['03','March','\xc3\x81\xc3。 .....

标签： python utf-8 beautifulsoup decode encode

【解决方案1】：

我的解决方案：

我现在调用convert而不是encode('utf-8')，它有点慢，但它有效。

def convert(content):
    #print content
    result = ''
    for char in content:
        asciichar = char.encode('ascii',errors="backslashreplace")[2:]
        if asciichar =='':
            utf8char = char.encode('utf8')
        else:  
            try:
                hexchar =  asciichar.decode('hex')
            except:
                #print asciichar
                utf8char = ' '
            try:
                utf8char = hexchar.encode('utf-8')
            except:
                #print hexchar
                utf8char = ' '
            #print utf8char

        result = result + utf8char    
        #print result
    return result

【讨论】：