【问题标题】:Get raw text from lxml element?从 lxml 元素获取原始文本?
【发布时间】:2020-12-27 15:09:15
【问题描述】:

我有一个很好的 python 代码。

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("some_url")
raw_educations = driver.find_elements_by_xpath("//div[@id='education']/div/div/div")
educations = raw_educations[0].text.split("\n")

raw_educations[0]是这一行

<div class="ds dt" id="u_0_2"><div class="cm du"><a class="co" href="/Harvard/"><img src="https://scontent.fhel6-1.fna.fbcdn.net/v/t34.0-1/cp0/e15/q65/p48x48/38977734_905688203096487_2026691898_n.jpg?_nc_cat=1&amp;ccb=2&amp;_nc_sid=dbb9e7&amp;efg=eyJpIjoiYiJ9&amp;_nc_ohc=j6Sj9DTpNBIAX9LLfur&amp;_nc_ht=scontent.fhel6-1.fna&amp;tp=3&amp;oh=c33ee8200b4e553ea92ca2edca2f4165&amp;oe=5FE9E69D" class="dv dw cb r" alt="Harvard University, profile picture"/></a><div class="dx cp"><div class="ee"><div><span class="dy dz de ea"><a class="cq" href="/Harvard/">Harvard University</a></span></div></div><div><span class="eb cs"><span class="ef ck cl">Computer Science and Psychology</span></span></div><div><span class="eb ec">30 August 2002 - 30 April 2004</span></div></div><div class="cr"/></div></div>

教育是这条线

['Harvard University', 'Computer Science and Psychology', '30 August 2002 - 30 April 2004']

我想用 lxml 库写类似的代码

我的代码

from lxml import etree

file_path = "Mark.html"
with open(file_path) as html_file:
    html = html_file.read()
    # print(html) # prints correct html
    htmlparser = etree.HTMLParser()
    tree = etree.parse(file_path, htmlparser)
    educations = tree.xpath("//div[@id='education']/div/div/div")
    print(etree.tostring(educations[0])) # Prints raw_educations[0], but i want educations

代码打印 raw_educations[0],但我想要教育 我应该对我的代码进行哪些更改?

Mark.html 代码在这里https://pastebin.com/BJbgXtg0

【问题讨论】:

    标签: python selenium lxml


    【解决方案1】:

    如果我对您的理解正确,应该通过将 lxml for 循环的结尾与您对 selenium 的结尾保持一致来解决这个问题。也就是改变

    print(etree.tostring(educations[0])) 
    

    print(educations[0].text.split("\n"))
    

    编辑 - 到:

    educations[0].xpath('.//span//text()')
    

    看看它是否有效。

    【讨论】:

    • 谢谢,但是没有用,因为educations[0].text 是None
    猜你喜欢
    • 2011-04-29
    • 2014-05-31
    • 2018-09-10
    • 1970-01-01
    • 2011-11-18
    • 2019-08-22
    • 2012-03-05
    • 2012-03-18
    相关资源
    最近更新 更多