【问题标题】:HTML from a webpage does not display foreign language characters correctly网页中的 HTML 无法正确显示外语字符
【发布时间】:2025-11-28 18:40:01
【问题描述】:

如果标题具有误导性,请致歉。

我试图通过查询歌词站点然后使用 CLD2 检查歌词的语言来找出给定歌曲的语言。但是,对于某些歌曲(例如下面给出的示例),外语字符未正确编码,这意味着 CLD2 会抛出此错误:input contains invalid UTF-8 around byte 2121 (of 32761)

import requests
import re
from bs4 import BeautifulSoup
import cld2

response = requests.get(https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html)

soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
    counter+=1
    if counter == 21:
        lyrics = item.get_text()
        checklang(lyrics)
        print("Lyrics found!")
        break

def checklang(lyrics):
    try:
        isReliable, textBytesFound, details = cld2.detect(lyrics)
        language = re.search("ENGLISH", str(details))
        
        if language == None:
            print("foreign lang")
                      
        if len(re.findall("Unknown", str(details))) < 2:
            print("foreign lang")
                      
        if language != None:
            print("english")
            pass

还值得一提的是,这不仅限于非拉丁字符,有时还会出现撇号或其他标点符号。

谁能解释为什么会发生这种情况或我可以做些什么来解决它?

【问题讨论】:

    标签: python http encoding utf-8 python-requests


    【解决方案1】:

    Requests 应该根据 HTTP 标头对响应的编码做出有根据的猜测。

    不幸的是,在给定的示例中,response.encoding 显示 ISO-8859-1,尽管 response.content 显示 &lt;meta charset="utf-8"&gt;

    这是我基于Response Content paragraph in the requests documentation 的解决方案。

    import requests
    import re
    from bs4 import BeautifulSoup
    # import cld2
    import pycld2 as cld2
    
    def checklang(lyrics):
            #try:
            isReliable, textBytesFound, details = cld2.detect(lyrics)
            # language = re.search("ENGLISH", str(details))
            for detail in details:
                print(detail)
    
    response = requests.get('https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html')
    
    print(response.encoding)
    response.encoding = 'utf-8'                         ### key change ###
    
    soup = BeautifulSoup(response.text, 'html.parser')
    counter = 0
    for item in soup.select("div"):
        counter+=1
        if counter == 21:
            lyrics = item.get_text()
            checklang(lyrics)
            print("Lyrics found!")
            break
    

    输出\SO\65630066.py

    ISO-8859-1
    ('ENGLISH', 'en', 74, 833.0)
    ('Korean', 'ko', 20, 3575.0)
    ('Unknown', 'un', 0, 0.0)
    Lyrics found!
    

    【讨论】: