网页中的 HTML 无法正确显示外语字符答案

【问题标题】：HTML from a webpage does not display foreign language characters correctly网页中的 HTML 无法正确显示外语字符
【发布时间】：2025-11-28 18:40:01
【问题描述】：

如果标题具有误导性，请致歉。

我试图通过查询歌词站点然后使用 CLD2 检查歌词的语言来找出给定歌曲的语言。但是，对于某些歌曲（例如下面给出的示例），外语字符未正确编码，这意味着 CLD2 会抛出此错误：input contains invalid UTF-8 around byte 2121 (of 32761)

import requests
import re
from bs4 import BeautifulSoup
import cld2

response = requests.get(https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html)

soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
    counter+=1
    if counter == 21:
        lyrics = item.get_text()
        checklang(lyrics)
        print("Lyrics found!")
        break

def checklang(lyrics):
    try:
        isReliable, textBytesFound, details = cld2.detect(lyrics)
        language = re.search("ENGLISH", str(details))
        
        if language == None:
            print("foreign lang")
                      
        if len(re.findall("Unknown", str(details))) < 2:
            print("foreign lang")
                      
        if language != None:
            print("english")
            pass

还值得一提的是，这不仅限于非拉丁字符，有时还会出现撇号或其他标点符号。

谁能解释为什么会发生这种情况或我可以做些什么来解决它？

【问题讨论】：

标签： python http encoding utf-8 python-requests

【解决方案1】：

Requests 应该根据 HTTP 标头对响应的编码做出有根据的猜测。

不幸的是，在给定的示例中，response.encoding 显示 ISO-8859-1，尽管 response.content 显示 <meta charset="utf-8">。

这是我基于Response Content paragraph in the requests documentation 的解决方案。

import requests
import re
from bs4 import BeautifulSoup
# import cld2
import pycld2 as cld2

def checklang(lyrics):
        #try:
        isReliable, textBytesFound, details = cld2.detect(lyrics)
        # language = re.search("ENGLISH", str(details))
        for detail in details:
            print(detail)

response = requests.get('https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html')

print(response.encoding)
response.encoding = 'utf-8'                         ### key change ###

soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
    counter+=1
    if counter == 21:
        lyrics = item.get_text()
        checklang(lyrics)
        print("Lyrics found!")
        break

输出：\SO\65630066.py

ISO-8859-1
('ENGLISH', 'en', 74, 833.0)
('Korean', 'ko', 20, 3575.0)
('Unknown', 'un', 0, 0.0)
Lyrics found!

【讨论】：