【问题标题】:Python: Getting Element Value from XPathPython:从 XPath 获取元素值
【发布时间】:2021-06-22 16:27:05
【问题描述】:

我对 xpaths 和网络抓取真的很陌生,所以如果这是一个相对较小的问题,我很抱歉。我正在尝试抓取多个网站,以确保更新数据库中的数据。我能够获取部分字符串的 xPath,但不确定如何使用 xPath 获取完整值。

代码:

def xpath_soup(element):
    components = []
    child = element if element.name else element.parent
    for parent in child.parents:

        previous = itertools.islice(parent.children, 0,parent.contents.index(child))
        xpath_tag = child.name
        xpath_index = sum(1 for i in previous if i.name == xpath_tag) + 1
        components.append(xpath_tag if xpath_index == 1 else '%s[%d]' % (xpath_tag, xpath_index))
        child = parent
    components.reverse()
    return '/%s' % '/'.join(components)



page = requests.get("https://www.gaumard.com/obstetricmr")
html = str(BeautifulSoup(page.content, 'html.parser'))
soup = BeautifulSoup(html, 'lxml')
elem = soup.find(string=re.compile('xt-generation mixed reality training solution for VICTORIA® S2200 designed to help learners bridge the gap between theory and practice'))
xPathValue = xpath_soup(elem)
print(xPathValue)

我正在尝试使用 xPathValue 获取元素的完整值。

预期结果将是完整版 xt-generation mixed reality training solution for VICTORIA® S2200 designed to help learners bridge the gap between theory and practice

存在

Obstetric MR™ is a next-generation mixed reality training solution for VICTORIA® S2200 designed to help learners bridge the gap between theory and practice faster than ever before. Using the latest technology in holographic visualization, Obstetric MR brings digital learning content into the physical simulation exercise, allowing participants to link knowledge and skill through an entirely new hands-on training experience. The future of labor and delivery simulation is here.

这个全部价值将来自利用xPathValue

【问题讨论】:

  • 预期的结果应该是什么?
  • @AndrejKesely 帖子已被编辑
  • 那么,您想将 Xpath 与 beautifulsoup 一起使用吗? bs4 拥有自己的 API 或正在使用 CSS 选择器
  • @AndrejKesely 我用什么对我来说真的没关系。我一直在尝试与我在 stackoverflow 上看到的其他东西不同的东西

标签: python html web-scraping xpath beautifulsoup


【解决方案1】:

以下是使用XPath 获取全文的方法。

import requests
from lxml import html

page = requests.get("https://www.gaumard.com/obstetricmr").text
text = html.fromstring(page).xpath('//*[@style="margin: 0 auto;"][2]/div/text()')
print(text[0].strip())

输出:

Obstetric MR™ is a next-generation mixed reality training solution for VICTORIA® S2200 designed to help learners bridge the gap between theory and practice faster than ever before. Using the latest technology in holographic visualization, Obstetric MR brings digital learning content into the physical simulation exercise, allowing participants to link knowledge and skill through an entirely new hands-on training experience. The future of labor and delivery simulation is here.

【讨论】:

  • 感谢您的回复。它可以工作,但我正在尝试自动创建 xpath,以便我也可以将此代码用于其他网站。有没有办法自动创建//*[@style="margin: 0 auto;"][2]/div/text()
  • 网站不同,我认为没有办法自动创建有效的 XPath。
  • 我这里的方法是找到字符串Obstetric MR™ is a next-generation mixed reality training的一部分在html中的位置,并据此找到Xpath。这不是一个好方法,还是有更好的方法?
【解决方案2】:

一个特定的 XPath 不会有太大帮助,因为如前所述,网页可能会有所不同。 用于搜索文本节点并获取包含该字符串的节点的数组或列表的通用 XPath 可以帮助进行一些后期处理。

在 Firefox 控制台上试用:

nodes = $x('//*[contains(text(),"next-generation mixed reality")]', window.document, "nodes");
<- Array [ div ]

nodes[0].textContent;
<- "Obstetric MR™ is a next-generation...(redacted)"

这个 XPath 可以在其他页面上工作
'//*[contains(text(),"next-generation mixed reality")]'
前提是它们包含next-generation mixed reality 字符串。

同样使用 python:

import requests
from lxml import html
url = 'https://www.gaumard.com/obstetricmr'
response = requests.get(url)
html_doc = response.content
xpath0 = '//*[contains(text(),"next-generation mixed reality")]'
result_arr = html.fromstring(html_doc).xpath(xpath0)
result_arr[0].text

输出:

'Obstetric MR™ is a next-generation mixed...'

【讨论】:

  • 嗨,路易斯,我有点困惑。我在找Python代码,上面是什么?
  • 它可以用任何使用 XPath 的语言来完成,因此关键部分是 xpath 本身。给我一些时间,我将添加一个 python 示例(今天晚些时候)。
  • @Bob 添加了 python 示例。
猜你喜欢
  • 2016-09-18
  • 2016-01-19
  • 2018-05-25
  • 2020-07-19
  • 2014-08-09
  • 2015-01-03
  • 2016-09-27
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多