从 lxml 元素获取原始文本？答案

【问题标题】：Get raw text from lxml element?从 lxml 元素获取原始文本？
【发布时间】：2020-12-27 15:09:15
【问题描述】：

我有一个很好的 python 代码。

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("some_url")
raw_educations = driver.find_elements_by_xpath("//div[@id='education']/div/div/div")
educations = raw_educations[0].text.split("\n")

raw_educations[0]是这一行

<div class="ds dt" id="u_0_2"><div class="cm du"><a class="co" href="/Harvard/"><img src="https://scontent.fhel6-1.fna.fbcdn.net/v/t34.0-1/cp0/e15/q65/p48x48/38977734_905688203096487_2026691898_n.jpg?_nc_cat=1&amp;ccb=2&amp;_nc_sid=dbb9e7&amp;efg=eyJpIjoiYiJ9&amp;_nc_ohc=j6Sj9DTpNBIAX9LLfur&amp;_nc_ht=scontent.fhel6-1.fna&amp;tp=3&amp;oh=c33ee8200b4e553ea92ca2edca2f4165&amp;oe=5FE9E69D" class="dv dw cb r" alt="Harvard University, profile picture"/></a><div class="dx cp"><div class="ee"><div><span class="dy dz de ea"><a class="cq" href="/Harvard/">Harvard University</a></span></div></div><div><span class="eb cs"><span class="ef ck cl">Computer Science and Psychology</span></span></div><div><span class="eb ec">30 August 2002 - 30 April 2004</span></div></div><div class="cr"/></div></div>

教育是这条线

['Harvard University', 'Computer Science and Psychology', '30 August 2002 - 30 April 2004']

我想用 lxml 库写类似的代码

我的代码

from lxml import etree

file_path = "Mark.html"
with open(file_path) as html_file:
    html = html_file.read()
    # print(html) # prints correct html
    htmlparser = etree.HTMLParser()
    tree = etree.parse(file_path, htmlparser)
    educations = tree.xpath("//div[@id='education']/div/div/div")
    print(etree.tostring(educations[0])) # Prints raw_educations[0], but i want educations

代码打印 raw_educations[0]，但我想要教育 我应该对我的代码进行哪些更改？

Mark.html 代码在这里https://pastebin.com/BJbgXtg0

【问题讨论】：

标签： python selenium lxml

【解决方案1】：

如果我对您的理解正确，应该通过将 lxml for 循环的结尾与您对 selenium 的结尾保持一致来解决这个问题。也就是改变

print(etree.tostring(educations[0]))

到

print(educations[0].text.split("\n"))

编辑 - 到：

educations[0].xpath('.//span//text()')

看看它是否有效。

【讨论】：

谢谢，但是没有用，因为educations[0].text 是None