【发布时间】:2020-12-27 15:09:15
【问题描述】:
我有一个很好的 python 代码。
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("some_url")
raw_educations = driver.find_elements_by_xpath("//div[@id='education']/div/div/div")
educations = raw_educations[0].text.split("\n")
raw_educations[0]是这一行
<div class="ds dt" id="u_0_2"><div class="cm du"><a class="co" href="/Harvard/"><img src="https://scontent.fhel6-1.fna.fbcdn.net/v/t34.0-1/cp0/e15/q65/p48x48/38977734_905688203096487_2026691898_n.jpg?_nc_cat=1&ccb=2&_nc_sid=dbb9e7&efg=eyJpIjoiYiJ9&_nc_ohc=j6Sj9DTpNBIAX9LLfur&_nc_ht=scontent.fhel6-1.fna&tp=3&oh=c33ee8200b4e553ea92ca2edca2f4165&oe=5FE9E69D" class="dv dw cb r" alt="Harvard University, profile picture"/></a><div class="dx cp"><div class="ee"><div><span class="dy dz de ea"><a class="cq" href="/Harvard/">Harvard University</a></span></div></div><div><span class="eb cs"><span class="ef ck cl">Computer Science and Psychology</span></span></div><div><span class="eb ec">30 August 2002 - 30 April 2004</span></div></div><div class="cr"/></div></div>
教育是这条线
['Harvard University', 'Computer Science and Psychology', '30 August 2002 - 30 April 2004']
我想用 lxml 库写类似的代码
我的代码
from lxml import etree
file_path = "Mark.html"
with open(file_path) as html_file:
html = html_file.read()
# print(html) # prints correct html
htmlparser = etree.HTMLParser()
tree = etree.parse(file_path, htmlparser)
educations = tree.xpath("//div[@id='education']/div/div/div")
print(etree.tostring(educations[0])) # Prints raw_educations[0], but i want educations
代码打印 raw_educations[0],但我想要教育 我应该对我的代码进行哪些更改?
Mark.html 代码在这里https://pastebin.com/BJbgXtg0
【问题讨论】: