使用 xpath 抓取网页内容时获取空列表答案

【问题标题】：Getting empty list when scraping web page content using xpath使用 xpath 抓取网页内容时获取空列表
【发布时间】：2021-12-22 09:05:50
【问题描述】：

当我尝试使用 xpath 从以下代码中的 url 检索一些数据时，我得到一个空列表：

from lxml import html
import requests

if __name__ == '__main__':
    url = 'https://www.leagueofgraphs.com/champions/stats/aatrox'

    page = requests.get(url)
    tree = html.fromstring(page.content)

    # XPath to get the XP
    print(tree.xpath('//*[@id="graphDD1"]/text()'))

>>> []

我期望的是一个像这样的字符串值：

>>> ['
        5.0%    ']

【问题讨论】：

标签： python python-3.x web-scraping request lxml

【解决方案1】：

这是因为您要搜索的 xpath 元素位于某些 JavaScript 中。

您需要找出调用 JavaScript 后生成的 cookie，以便您可以对 URL 进行相同的调用。

转到开发控制台的“网络”页面
在abg_lite.js 运行后查找请求标头中的差异（我的是cookie: __cf_bm=TtnYbPlIA0J_GOhNj2muKa1pi8pU38iqA3Yglaua7q8-1636535361-0- AQcpStbhEdH3oPnKSuPIRLHVBXaqVwo+zf6d3YI/rhmk/RvN5B7OaIcfwtvVyR0IolwcoCk4ClrSvbBP4DVJ 70I=）
将 cookie 添加到您的请求中

from lxml import html
import requests

if __name__ == '__main__':
    url = 'https://www.leagueofgraphs.com/champions/stats/aatrox'

    # Create a session to add cookies and headers to
    s = requests.Session()

    # After finding the correct cookie, update your sessions cookie jar
    # add your own cookie here
    s.cookies['cookie'] = '__cf_bm=TtnYbPlIA0J_GOhNj2muKa1pi8pU38iqA3Yglaua7q8-1636535361-0-'
'AQcpStbhEdH3oPnKSuPIRLHVBXaqVwo+zf6d3YI/rhmk/RvN5B7OaIcfwtvVyR0IolwcoCk4ClrSvbBP4DVJ70I='

    # Update headers to spoof a regular browser; this may not be necessary
    # but is good practice to bypass any basic bot detection
    s.headers.update({
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
            })

    page = s.get(url)
    tree = html.fromstring(page.content)

    # XPath to get the XP
    print(tree.xpath('//*[@id="graphDD1"]/text()'))

实现如下输出：-

['\r\n 5.0% ']

【讨论】：