【发布时间】:2016-09-16 15:50:18
【问题描述】:
我正在使用requests 请求页面。任务很简单,但是编码有问题。该页面包含非ascii,土耳其语字符,但在HTML源中,结果如下:
ÇINARTEPE # What it looks like
ÇINARTEPE # What it is like in HTML source
所以,下面的操作没有返回我预期的结果:
# What I have tried as encoding
req.encoding = "utf-8"
req.encoding = "iso-8859-9"
req.encoding = "iso-8859-1"
# The operations
"ÇINARTEPE" in req.text # False, it must return True
bytes("ÇINARTEPE", "utf-8") in req.content # False
bytes("ÇINARTEPE", "iso-8859-9") in req.content # False
bytes("ÇINARTEPE", "iso-8859-1") in req.content # False
我只想找出 "ÇINARTEPE" 字符串是否在 HTML 源代码中。
更多信息
一个例子:
req = requests.get("http://www.eshot.gov.tr/tr/OtobusumNerede/290")
"ÇINARTEPE" in req.text # False
req.encoding = "iso-8859-1"
"ÇINARTEPE" in req.text # False
req.encoding = "iso-8859-9"
"ÇINARTEPE" in req.text # False
# Supposed to return True
环境
- python 3.5.1
- 请求 2.10.0
【问题讨论】:
-
你是怎么处理的?给我们看一些代码!
-
更新了问题
-
不就是
html.unescape("ÇINARTEPE")吗? ^checks^ 是的,我想就是这样。 -
@TadhgMcDonald-Jensen,等你写答案标记为有效。
-
JEan PAul 击败了我,我宁愿错过一些代表然后发布重复的答案。
标签: python python-3.x web-scraping python-requests