【问题标题】:Getting empty list when scraping web page content using xpath使用 xpath 抓取网页内容时获取空列表
【发布时间】:2021-12-22 09:05:50
【问题描述】:

当我尝试使用 xpath 从以下代码中的 url 检索一些数据时,我得到一个空列表:

from lxml import html
import requests

if __name__ == '__main__':
    url = 'https://www.leagueofgraphs.com/champions/stats/aatrox'

    page = requests.get(url)
    tree = html.fromstring(page.content)

    # XPath to get the XP
    print(tree.xpath('//*[@id="graphDD1"]/text()'))
>>> []

我期望的是一个像这样的字符串值:

>>> ['
        5.0%    ']

【问题讨论】:

    标签: python python-3.x web-scraping request lxml


    【解决方案1】:

    这是因为您要搜索的 xpath 元素位于某些 JavaScript 中。

    您需要找出调用 JavaScript 后生成的 cookie,以便您可以对 URL 进行相同的调用。

    1. 转到开发控制台的“网络”页面
    2. abg_lite.js 运行后查找请求标头中的差异(我的是cookie: __cf_bm=TtnYbPlIA0J_GOhNj2muKa1pi8pU38iqA3Yglaua7q8-1636535361-0- AQcpStbhEdH3oPnKSuPIRLHVBXaqVwo+zf6d3YI/rhmk/RvN5B7OaIcfwtvVyR0IolwcoCk4ClrSvbBP4DVJ 70I=
    3. 将 cookie 添加到您的请求中
    from lxml import html
    import requests
    
    if __name__ == '__main__':
        url = 'https://www.leagueofgraphs.com/champions/stats/aatrox'
    
        # Create a session to add cookies and headers to
        s = requests.Session()
    
        # After finding the correct cookie, update your sessions cookie jar
        # add your own cookie here
        s.cookies['cookie'] = '__cf_bm=TtnYbPlIA0J_GOhNj2muKa1pi8pU38iqA3Yglaua7q8-1636535361-0-'
    'AQcpStbhEdH3oPnKSuPIRLHVBXaqVwo+zf6d3YI/rhmk/RvN5B7OaIcfwtvVyR0IolwcoCk4ClrSvbBP4DVJ70I='
    
        # Update headers to spoof a regular browser; this may not be necessary
        # but is good practice to bypass any basic bot detection
        s.headers.update({
                    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
    ' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
                })
    
        page = s.get(url)
        tree = html.fromstring(page.content)
    
        # XPath to get the XP
        print(tree.xpath('//*[@id="graphDD1"]/text()'))
    

    实现如下输出:-

    ['\r\n 5.0% ']

    【讨论】:

      猜你喜欢
      • 2022-11-11
      • 1970-01-01
      • 2014-09-23
      • 1970-01-01
      • 1970-01-01
      • 2010-10-09
      相关资源
      最近更新 更多