【发布时间】:2025-11-28 18:40:01
【问题描述】:
如果标题具有误导性,请致歉。
我试图通过查询歌词站点然后使用 CLD2 检查歌词的语言来找出给定歌曲的语言。但是,对于某些歌曲(例如下面给出的示例),外语字符未正确编码,这意味着 CLD2 会抛出此错误:input contains invalid UTF-8 around byte 2121 (of 32761)
import requests
import re
from bs4 import BeautifulSoup
import cld2
response = requests.get(https://www.azlyrics.com/lyrics/blackpink/ddududdudu.html)
soup = BeautifulSoup(response.text, 'html.parser')
counter = 0
for item in soup.select("div"):
counter+=1
if counter == 21:
lyrics = item.get_text()
checklang(lyrics)
print("Lyrics found!")
break
def checklang(lyrics):
try:
isReliable, textBytesFound, details = cld2.detect(lyrics)
language = re.search("ENGLISH", str(details))
if language == None:
print("foreign lang")
if len(re.findall("Unknown", str(details))) < 2:
print("foreign lang")
if language != None:
print("english")
pass
还值得一提的是,这不仅限于非拉丁字符,有时还会出现撇号或其他标点符号。
谁能解释为什么会发生这种情况或我可以做些什么来解决它?
【问题讨论】:
标签: python http encoding utf-8 python-requests