【问题标题】:Get all values of specific key with xpath (python web scraping)使用 xpath 获取特定键的所有值(python 网络抓取)
【发布时间】:2020-08-13 03:21:17
【问题描述】:

假设我们有网页

<div class="specific-row" data-id="101736782"></div>
<div class="yellow-box-row" data-id="112376244"></div>
<div class="specific-row" data-id="179218312"></div>
<div class="vip-row" data-id="123749014"></div>

如何获取所有 data-id 值? 赞['101736782', '112376244', '179218312', '123749014']

我使用了tree.xpath

import requests
from lxml import html

r = requests.get(url)
tree = html.fromstring(r.content)

tree.xpath("//div@data-id=['any']")

【问题讨论】:

  • XPath 2.0 解决方案将是string-join(//@data-id,";") 与适用于 Python 的 Saxon/C 处理器相结合。输出:101736782;112376244;179218312;123749014

标签: python html css xpath web-scraping


【解决方案1】:

我试试这个...

from lxml import etree, html

doc = '<root><div class="specific-row" data-id="101736782"></div><div class="yellow-box-row" data-id="112376244"></div><div class="specific-row" data-id="179218312"></div><div class="vip-row" data-id="123749014"></div></root>'

root = etree.XML(doc) # EQUALS TO >>> root = html.fromstring(doc)

xpatheval = etree.XPathEvaluator(root)

divs = xpatheval('//div')
ids = [el.get('data-id') for el in divs]

## If you have installed cssselect you can do

divs = root.cssselect('[data-id]')
ids = [el.get('data-id') for el in divs]

# (cssselect) use the same schema of selection of 'some_element_node.querySelector("data-id")' of browsers

# Maybe this is what you are looking for -- https://lxml.de/tutorial.html#elementpath
root.findall('div[@data-id]')

我用这个link 来帮助我。

【讨论】:

  • 嗨丹尼尔。感谢您的回答,它确实有效。我尝试使用 attrib 方法并且也有效:ids=[] for atag in tree.xpath("//div[@id='results']/div"): try: #print(atag.attrib['data-id']) ids.append(atag.attrib['data-id']) except KeyError: continue
【解决方案2】:

我尝试使用 attrib 方法并且它有效:

ids=[]
for atag in tree.xpath("//div[@id='results']/div"):
  try:
    #print(atag.attrib['data-id'])
    ids.append(atag.attrib['data-id'])
  except KeyError:
    continue

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多