使用 lxml.html 进行 UTF-8 编码答案

【问题标题】：UTF-8 Encoding with lxml.html使用 lxml.html 进行 UTF-8 编码
【发布时间】：2015-10-26 05:20:04
【问题描述】：

如何从 Google Play 的日语概要中获取正确的编码？这是我目前所拥有的：

import requests
from lxml import html
res=requests.get('https://play.google.com/store/tv/show?id=bgJpf84fT4Q')
node=html.fromstring(res.content)
print node.xpath('//div[@itemprop="description"]')[0].text

如何在 text 属性上设置 utf-8 编码？

【问题讨论】：

如果您使用的是requests，为什么不使用BeautifulSoup？
@Kupiakos 我只是发现从 lxml 解析 xpath 更容易一些。这是我第一次遇到非拉丁字符的编码问题。

标签： python unicode lxml

【解决方案1】：

首先，使用res.text，而不是res.content。前者是已经解码的unicode。后者是尚未解码的str。

node=html.fromstring(res.text)

其次，该页面上没有<div itemprop="description">。我能找到的唯一itemprop="description" 是<meta>，而不是<div>，如下所示：

print [n.tag for n in node.xpath('//*[@itemprop="description"]')]

【讨论】：

谢谢。这是我所看到的：
itemprop="description"> Arman 已经放弃了他所有的快乐和恶习：吸烟、饮酒和快餐，他经历了最初的极限格斗术培训期为六个月。现在，Arman 游历了全球 10 个异国他乡，包括日本、中国、美国、柬埔寨和马来西亚——每个国家都是不同武术的发源地！他必须学习重要的战斗技巧，极端的纪律和极端的痛苦！ ……s？
另外，res.text 和 res.url 有什么区别？
我在使用 Chrome 的开发控制台时看到 <div class="show-more-content text-body" itemprop="description">，但在使用 requests 时看不到。我想知道div是不是由Javascript合成的，还是根据user-agent返回不同的数据。
知道了，感谢您在更新的答案中的澄清。