【问题标题】:Python script using lxml, xpath returning null list使用lxml的Python脚本,xpath返回空列表
【发布时间】:2019-07-08 01:43:12
【问题描述】:

我尝试使用带有 lxml 的 xpath 从 html 标记中抓取 href 链接。但是 xpath 正在返回 null 列表,而它是单独测试的,它似乎可以工作。

代码返回空值,而 xpath 似乎工作正常。

page = self.opener.open(link).read()
doc=html.fromstring(str(page))
ref = doc.xpath('//ul[@class="s-result-list s-col-1 s-col-ws-1 s-result-list-hgrid s-height-equalized s-list-view s-text-condensed s-item-container-height-auto"]/li/div/div[@class="a-fixed-left-grid"]/div/div[@class="a-fixed-left-grid-col a-col-left"]/div/div/a')
for post in ref:
    print(post.get("href"))

我正在使用代理服务器来访问链接,它似乎可以工作,因为“doc”变量正在填充 html 内容。 我检查了链接,并且在正确的页面上获取此 xpath。

这是我试图从中获取数据的链接: https://www.amazon.com/s/ref=lp_266162_nr_n_0?fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&bbn=266162&ie=UTF8&qid=1550120216&rnid=266162

【问题讨论】:

    标签: python xpath web-scraping lxml


    【解决方案1】:

    我想你是在Books : Arts & Photography : Architecture : Buildings : Landmarks & Monuments 中的链接之后。我在脚本中使用 xpath 来获取链接。试一试:

    import requests
    from lxml.html import fromstring
    
    link = 'https://www.amazon.com/s/ref=lp_266162_nr_n_0?fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&bbn=266162&ie=UTF8&qid=1550120216&rnid=266162'
    r = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
    htmlcontent = fromstring(r.text)
    itemlinks = htmlcontent.xpath('//*[@id="mainResults"]//*[contains(@class,"s-access-detail-page")]')
    for link in itemlinks:
        print(link.get('href'))
    

    如果你想使用 css 选择器,那么以下应该可以工作:

    '#mainResults .s-access-detail-page'
    

    【讨论】:

      【解决方案2】:

      您的 xpath 选择器无效。试试下面的 CSS 选择器

      import requests
      import lxml, lxml.html
      
      url = 'https://www.amazon.com/s/ref=lp_266162_nr_n_0?fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&bbn=266162&ie=UTF8&qid=1550120216&rnid=266162'
      r = requests.get(url)
      html = lxml.html.fromstring(r.content)
      links = html.cssselect('.a-fixed-left-grid-col .a-col-left a')
      for link in links:
          print(link.attrib['href'])
      

      输出

      https://www.amazon.com/Top-500-Instant-Pot-Recipes/dp/1730885209
      https://www.amazon.com/Monthly-Budget-Planner-Organizer-Notebook/dp/1978202865
      https://www.amazon.com/Edge-Order-Daniel-Libeskind/dp/045149735X
      https://www.amazon.com/Man-Glass-House-Johnson-Architect/dp/0316126438
      https://www.amazon.com/Versailles-Private-Invitation-Guillaume-Picon/dp/2080203371
      https://www.amazon.com/Palm-Springs-Modernist-Tim-Street-Porter/dp/0847861872
      https://www.amazon.com/Building-Chicago-Architectural-John-Zukowsky/dp/0847848701
      https://www.amazon.com/Taverns-American-Revolution-Adrian-Covert/dp/160887785X
      https://www.amazon.com/TRAVEL-MOSAIC-Color-Number-Relaxation/dp/1717562221
      https://www.amazon.com/Understanding-Cemetery-Symbols-Historic-Graveyards/dp/1547047216
      https://www.amazon.com/Soviet-Bus-Stops-Christopher-Herwig/dp/099319110X
      https://www.amazon.com/Famous-Movie-Scenes-Dot-Dot/dp/1977747043
      

      点子要求

      certifi==2018.11.29
      chardet==3.0.4
      cssselect==1.0.3
      idna==2.8
      lxml==4.3.1
      requests==2.21.0
      urllib3==1.24.1
      

      【讨论】:

      • 但是当我在控制台页面中使用 JS 测试时,xpath 选择器给出了结果,请参考附图。
      • @AjayVictor 我已经尝试过使用 JS 选择器,但它也不能正常工作。尝试刷新页面并重试。
      • 仍然,结果来了,我正在寻找一个 xpath 结果,如果我没有得到一个,那么将接受你的答案。感谢您的努力。
      • 尝试打开上面的链接并检查元素,然后右键单击它,然后选择选择器,然后选择 xpath,然后尝试它。我无法得到它可能是由于动态javascript。
      猜你喜欢
      • 1970-01-01
      • 2019-07-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-05-08
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多