使用波斯语字符串查找美丽的汤答案

【问题标题】：Beautiful Soup find using Persian string使用波斯语字符串查找美丽的汤
【发布时间】：2016-08-07 11:48:56
【问题描述】：

我想在 Python 中使用 Beautiful Soup 查找包含字符串的所有元素。

当我使用非波斯字符时它可以工作，但当我使用波斯字符时它不起作用。

from bs4 import BeautifulSoup
QUERY = 'رشته فارسی'
URL = 'http://www.example.com'
headers = {
  'User-Agent': "Mozilla/5.0 . . . "
}
request = urllib2.Request(URL, headers=headers)
response = urllib2.urlopen(request)
response_content = response.read().decode('utf8')
soup = BeautifulSoup(response_content, 'html.parser')
fetched = soup.find_all(text=QUERY)
print(fetched)

对于上面的代码，输出是[]，但是如果我在查询中使用 ASCII 就可以了。

有 UTF-8 转换之类的解决方法吗：）？

【问题讨论】：

你需要匹配页面上的编码
@PadraicCunningham 我该怎么做？
你不应该使用 utf-8 或 UTF-8 代替 utf8 吗？
您使用的是 Python 3，对吗？我不懂波斯语，但你试过 normalizing QUERY 和 response_content 吗？仅仅因为 2 个字符串看起来相同，它们不必由相同的代码点组成（它们不必在计算机上看起来相同）。
当你print(repr(QUERY))你看到了什么？

标签： python web-scraping beautifulsoup persian

【解决方案1】：

    #-*- coding: utf-8 -*-
    import urllib2
    from bs4 import BeautifulSoup
    QUERY = 'خدمات'
    URL = 'https://bayan.ir/service/bayan/'
    headers = {
          'User-Agent': "Mozilla/5.0 . . . "
    }
    request = urllib2.Request(URL, headers=headers)
    response = urllib2.urlopen(request)
    response_content = response.read()
    soup = BeautifulSoup(response_content, 'html.parser')
    fetched = soup.find_all(string=QUERY)
    print(fetched)

有效！

【讨论】：

你必须发送准确的字符串检查BeautifulSoup