【问题标题】:Getting empty list when scraping web page content using xpath使用 xpath 抓取网页内容时获取空列表
【发布时间】:2021-12-22 09:05:50
【问题描述】:
当我尝试使用 xpath 从以下代码中的 url 检索一些数据时,我得到一个空列表:
from lxml import html
import requests
if __name__ == '__main__':
url = 'https://www.leagueofgraphs.com/champions/stats/aatrox'
page = requests.get(url)
tree = html.fromstring(page.content)
# XPath to get the XP
print(tree.xpath('//*[@id="graphDD1"]/text()'))
>>> []
我期望的是一个像这样的字符串值:
>>> ['
5.0% ']
【问题讨论】:
标签:
python
python-3.x
web-scraping
request
lxml
【解决方案1】:
这是因为您要搜索的 xpath 元素位于某些 JavaScript 中。
您需要找出调用 JavaScript 后生成的 cookie,以便您可以对 URL 进行相同的调用。
- 转到开发控制台的“网络”页面
- 在
abg_lite.js 运行后查找请求标头中的差异(我的是cookie: __cf_bm=TtnYbPlIA0J_GOhNj2muKa1pi8pU38iqA3Yglaua7q8-1636535361-0- AQcpStbhEdH3oPnKSuPIRLHVBXaqVwo+zf6d3YI/rhmk/RvN5B7OaIcfwtvVyR0IolwcoCk4ClrSvbBP4DVJ 70I=)
- 将 cookie 添加到您的请求中
from lxml import html
import requests
if __name__ == '__main__':
url = 'https://www.leagueofgraphs.com/champions/stats/aatrox'
# Create a session to add cookies and headers to
s = requests.Session()
# After finding the correct cookie, update your sessions cookie jar
# add your own cookie here
s.cookies['cookie'] = '__cf_bm=TtnYbPlIA0J_GOhNj2muKa1pi8pU38iqA3Yglaua7q8-1636535361-0-'
'AQcpStbhEdH3oPnKSuPIRLHVBXaqVwo+zf6d3YI/rhmk/RvN5B7OaIcfwtvVyR0IolwcoCk4ClrSvbBP4DVJ70I='
# Update headers to spoof a regular browser; this may not be necessary
# but is good practice to bypass any basic bot detection
s.headers.update({
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
})
page = s.get(url)
tree = html.fromstring(page.content)
# XPath to get the XP
print(tree.xpath('//*[@id="graphDD1"]/text()'))
实现如下输出:-
['\r\n 5.0% ']