【发布时间】:2021-06-22 16:27:05
【问题描述】:
我对 xpaths 和网络抓取真的很陌生,所以如果这是一个相对较小的问题,我很抱歉。我正在尝试抓取多个网站,以确保更新数据库中的数据。我能够获取部分字符串的 xPath,但不确定如何使用 xPath 获取完整值。
代码:
def xpath_soup(element):
components = []
child = element if element.name else element.parent
for parent in child.parents:
previous = itertools.islice(parent.children, 0,parent.contents.index(child))
xpath_tag = child.name
xpath_index = sum(1 for i in previous if i.name == xpath_tag) + 1
components.append(xpath_tag if xpath_index == 1 else '%s[%d]' % (xpath_tag, xpath_index))
child = parent
components.reverse()
return '/%s' % '/'.join(components)
page = requests.get("https://www.gaumard.com/obstetricmr")
html = str(BeautifulSoup(page.content, 'html.parser'))
soup = BeautifulSoup(html, 'lxml')
elem = soup.find(string=re.compile('xt-generation mixed reality training solution for VICTORIA® S2200 designed to help learners bridge the gap between theory and practice'))
xPathValue = xpath_soup(elem)
print(xPathValue)
我正在尝试使用 xPathValue 获取元素的完整值。
预期结果将是完整版
xt-generation mixed reality training solution for VICTORIA® S2200 designed to help learners bridge the gap between theory and practice
存在
Obstetric MR™ is a next-generation mixed reality training solution for VICTORIA® S2200 designed to help learners bridge the gap between theory and practice faster than ever before. Using the latest technology in holographic visualization, Obstetric MR brings digital learning content into the physical simulation exercise, allowing participants to link knowledge and skill through an entirely new hands-on training experience. The future of labor and delivery simulation is here.
这个全部价值将来自利用xPathValue。
【问题讨论】:
-
预期的结果应该是什么?
-
@AndrejKesely 帖子已被编辑
-
那么,您想将 Xpath 与
beautifulsoup一起使用吗?bs4拥有自己的 API 或正在使用 CSS 选择器 -
@AndrejKesely 我用什么对我来说真的没关系。我一直在尝试与我在 stackoverflow 上看到的其他东西不同的东西
标签: python html web-scraping xpath beautifulsoup