如何从网页中提取内容及其父 HTML 元素？答案

【问题标题】：How do I extract both content and their parent HTML elements from a web page?如何从网页中提取内容及其父 HTML 元素？
【发布时间】：2019-05-11 21:13:03
【问题描述】：

假设您有一个网页：

<html>
<head>
<meta name="description" content="Hello World Test">
</head>
<body>
<h1>Hello World!!!</h1>
<p>How are you today?</p>
<p>What have you been up to?</p>
</body>
</html>

有没有办法循环遍历页面上的节点，然后，如果节点包含文本，则提取文本？

然后我想通过它的 Xpath 来组织文本。

所以上面是：

/html/body/h1：世界你好！！！

/html/body/p[1]：你今天好吗？

/html/body/p[2]：你最近在做什么？

非常感谢

【问题讨论】：

标签： python xpath

【解决方案1】：

您可以使用来自例如 lxml 库的 XPath 来迭代所有 HTML 节点，如果迭代节点包含任何文本，则使用路径检索内容：

from lxml import html

tree = html.fromstring("""
<html>
 <head>
  <meta content="Hello World Test" name="description"/>
 </head>
 <body>
  <h1>Hello World!!!</h1>
  <p>How are you today?</p>
  <p>What have you been up to?</p>
 </body>
</html>
""")

for node in tree.iter():
    if node.text and node.text.strip():
        print(node.getroottree().getpath(node), node.text)

/html/body/h1 你好世界！！！

/html/body/p[1]你今天好吗？

/html/body/p[2]你最近在做什么？

【讨论】：

【解决方案2】：

如果您使用硒，这里是解决方案。

nodes = driver.find_elements_by_xpath("//body/*")
for node in nodes:
    nodepath =''
    nodeText = node.text
    while node.tag_name!='html':
        nodepath = node.tag_name + "/" + nodepath
        node = node.find_element_by_xpath("./..")
    print('html/' + nodepath[0:-1] + ":" + nodeText)

【讨论】：