使用 lxml 解析 html 文档时出现编码问题答案

【问题标题】：Problems with encoding while parsing html document with lxml使用 lxml 解析 html 文档时出现编码问题
【发布时间】：2015-06-23 06:13:06
【问题描述】：

我正在尝试从某些网页中获取干净的文本。我阅读了很多教程，最后得到了 python lxml + beautifulsoup + requests modules 。为这样的任务使用lxml 的原因是它比漂亮的汤更好地清理 html 文件。

我最终得到了这样的测试脚本：

from bs4 import UnicodeDammit                                                    
import re                                                                        
import requests                                                                  
import lxml                                                                      
import lxml.html                                                                 
from time import sleep                                                           

urls = [                                                                         
    "http://mathprofi.ru/zadachi_po_kombinatorike_primery_reshenij.html",        
    "http://ru.onlinemschool.com/math/assistance/statistician/",                 
    "http://mathprofi.ru/zadachi_po_kombinatorike_primery_reshenij.html",        
    "http://universarium.org/courses/info/332",                                  
    "http://compsciclub.ru/course/wordscombinatorics",                           
    "http://ru.onlinemschool.com/math/assistance/statistician/",                 
    "http://lectoriy.mipt.ru/course/Maths-Combinatorics-AMR-Lects/",             
    "http://www.youtube.com/watch?v=SLPrGWQBX0I"                                 
]                                                                                


def check(url):                                                                     
    print "That is url {}".format(url)                                           
    r = requests.get(url)                                                        
    ud = UnicodeDammit(r.content, is_html=True)                                  
    content = ud.unicode_markup.encode(ud.original_encoding, "ignore")           
    root = lxml.html.fromstring(content)                                            
    lxml.html.etree.strip_elements(root, lxml.etree.Comment,                        
                                   "script", "style")                               
    text = lxml.html.tostring(root, method="text", encoding=unicode)                
    text = re.sub('\s+', ' ', text)                                                 
    print "Text type is {}!".format(type(text))                                     
    print text[:200]                                                                
    sleep(1)                                                                        


if __name__ == '__main__':                                                          
    for url in urls:                                                                
        check(url)

需要对原始编码进行中间解编码和重新编码，因为 html 页面可能包含一些编码方式与大多数其他字符不同的字符。这样的场合进一步打破了 lxmltostring 方法。

但是，我的代码无法在所有测试中正常工作。有时（尤其是最后两个 url）它会输出混乱：

...
That is url http://ru.onlinemschool.com/math/assistance/statistician/
Text type is <type 'unicode'>!
 Онлайн решение задач по математике. Комбинаторика. Теория вероятности. Close Авторизация на сайте Введите логин: Введите пароль: Запомнить меня Регистрация Изучение математики онлайн.Изучайте математ
That is url http://lectoriy.mipt.ru/course/Maths-Combinatorics-AMR-Lects/
Text type is <type 'unicode'>!
 ÐÐ°ÑÐµÐ¼Ð°ÑÐ¸ÐºÐ°. ÐÑÐ½Ð¾Ð²Ñ ÐºÐ¾Ð¼Ð±Ð¸Ð½Ð°ÑÐ¾ÑÐ¸ÐºÐ¸ Ð¸ ÑÐµÐ¾ÑÐ¸Ð¸ ÑÐ¸ÑÐµÐ» / ÐÐ¸Ð´ÐµÐ¾Ð»ÐµÐºÑÐ¸Ð¸ Ð¤Ð¸Ð·ÑÐµÑÐ°: ÐÐµÐºÑÐ¾ÑÐ¸Ð¹ ÐÐ¤Ð¢Ð - Ð²Ð¸Ð´ÐµÐ¾Ð»ÐµÐºÑÐ¸Ð¸ Ð¿Ð¾ ÑÐ¸Ð·Ð¸ÐºÐµ,
That is url http://www.youtube.com/watch?v=SLPrGWQBX0I
Text type is <type 'unicode'>!
 ÐÑÐ½Ð¾Ð²Ð½ÑÐµ ÑÐ¾ÑÐ¼ÑÐ»Ñ ÐºÐ¾Ð¼Ð±Ð¸Ð½Ð°ÑÐ¾ÑÐ¸ÐºÐ¸ - bezbotvy - YouTube ÐÑÐ¾Ð¿ÑÑÑÐ¸ÑÑ RU ÐÐ¾Ð±Ð°Ð²Ð¸ÑÑ Ð²Ð¸Ð´ÐµÐ¾ÐÐ¾Ð¹ÑÐ¸ÐÐ¾Ð¸ÑÐº ÐÐ°Ð³ÑÑÐ·ÐºÐ°... ÐÑÐ±ÐµÑÐ¸ÑÐµ ÑÐ·ÑÐº.

这个混乱与编码ISO-8859-1 有某种联系，但我不知道如何。对于我得到的最后两个网址中的每一个：

In [319]: r = requests.get(urls[-1])

In [320]: chardet.detect(r.content)
Out[320]: {'confidence': 0.99, 'encoding': 'utf-8'}

In [321]: UnicodeDammit(r.content, is_html=True).original_encoding
Out[321]: 'utf-8'

In [322]: r = requests.get(urls[-2])

In [323]: chardet.detect(r.content)
Out[323]: {'confidence': 0.99, 'encoding': 'utf-8'}

In [324]: UnicodeDammit(r.content, is_html=True).original_encoding
Out[324]: u'utf-8'

所以我猜lxml 根据输入字符串的错误假设进行内部解码。我认为它甚至不会尝试猜测输入字符串编码。似乎在 lxml 的核心发生了这样的事情：

In [339]: print unicode_string.encode('utf-8').decode("ISO-8859-1", "ignore")
ÑÑÑÐ¾ÐºÐ°

如何解决我的问题并清除 html 标签中的所有 url？也许我应该使用另一个 python 模块或以另一种方式做？请给我你的建议。

【问题讨论】：

我不知道 UnicodeDammit 是什么，但是... r.content 应该是 unicode 并且应该允许您处理正确的内容。您可以改为检索 r.text 。这肯定看起来很乱，但从这里你可以尝试编码，就像你现在在 ascii 中一样
相关stackoverflow.com/questions/2307795/…, stackoverflow.com/questions/1495627/…
UnicodeDammit 允许根据 http 标头、meta 标签和内容猜测网页的字符集。 r.content 不是unicode，它是来自服务器的简单字节序列。 r.text 将 r.content 转换为 unicode 假设它在 r.encoding 中。但是r.encoding 并不总是正确的，我无法使用r.text 方法，因为我收到来自lxml 的错误：UnicodeDecodeError: 'utf8' codec can't decode byte 0x.. in position ..: invalid continuation byte

标签： python html unicode lxml scrape

【解决方案1】：

我终于明白了。解决办法是不使用

root = lxml.html.fromstring(content)

但是配置一个显式的 Parser 对象，可以告诉它使用特定的编码enc:

htmlparser = etree.HTMLParser(encoding=enc)
root = etree.HTML(content, parser=htmlparser)

此外，我发现即使UnicodeDammit 在决定页面编码时也会犯明显的错误。所以我添加了另一个if 块：

if (declared_enc and enc != declared_enc):

这是一个结果的sn-p：

from lxml import html
from lxml.html import etree
import requests
from bs4 import UnicodeDammit
import chardet 


try:
    self.log.debug("Try to get content from page {}".format(url))
    r = requests.get(url)
except requests.exceptions.RequestException as e:
    self.log.warn("Unable to get page content of the url: {url}. "
                  "The reason: {exc!r}".format(url=url, exc=e))
    raise ParsingError(e.message)

ud = UnicodeDammit(r.content, is_html=True)

enc = ud.original_encoding.lower()
declared_enc = ud.declared_html_encoding
if declared_enc:
    declared_enc = declared_enc.lower()
# possible misregocnition of an encoding
if (declared_enc and enc != declared_enc):
    detect_dict = chardet.detect(r.content)
    det_conf = detect_dict["confidence"]
    det_enc = detect_dict["encoding"].lower()
    if enc == det_enc and det_conf < THRESHOLD_OF_CHARDETECT:
        enc = declared_enc
# if page contains any characters that differ from the main
# encodin we will ignore them
content = r.content.decode(enc, "ignore").encode(enc)
htmlparser = etree.HTMLParser(encoding=enc)
root = etree.HTML(content, parser=htmlparser)
etree.strip_elements(root, html.etree.Comment, "script", "style")
text = html.tostring(root, method="text", encoding=unicode)

【讨论】：