使用 Python 2.7 的 HTML 解析树答案

【问题标题】：HTML Parse tree using Python 2.7使用 Python 2.7 的 HTML 解析树
【发布时间】：2023-03-20 15:45:01
【问题描述】：

我试图为下面的 HTML 表配置一个解析树，但无法形成它。我想看看树结构的样子！有人可以帮我吗？

# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

编辑

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\matt>easy_install ete2
Searching for ete2
Reading http://pypi.python.org/simple/ete2/
Reading http://ete.cgenomics.org
Reading http://ete.cgenomics.org/releases/ete2/
Reading http://ete.cgenomics.org/releases/ete2
Best match: ete2 2.1rev539
Downloading http://ete.cgenomics.org/releases/ete2/ete2-2.1rev539.tar.gz
Processing ete2-2.1rev539.tar.gz
Running ete2-2.1rev539\setup.py -q bdist_egg --dist-dir c:\users\arupra~1\appdat
a\local\temp\easy_install-sypg3x\ete2-2.1rev539\egg-dist-tmp-zemohm

Installing ETE (A python Environment for Tree Exploration).

Checking dependencies...
numpy cannot be found in your python installation.
Numpy is required for the ArrayTable and ClusterTree classes.
MySQLdb cannot be found in your python installation.
MySQLdb is required for the PhylomeDB access API.
PyQt4 cannot be found in your python installation.
PyQt4 is required for tree visualization and image rendering.
lxml cannot be found in your python installation.
lxml is required from Nexml and Phyloxml support.

However, you can still install ETE without such functionality.
Do you want to continue with the installation anyway? [y,n]y
Your installation ID is: d33ba3b425728e95c47cdd98acda202f
warning: no files found matching '*' under directory '.'
warning: no files found matching '*.*' under directory '.'
warning: manifest_maker: MANIFEST.in, line 4: path 'doc/ete_guide/' cannot end w
ith '/'

warning: manifest_maker: MANIFEST.in, line 5: path 'doc/' cannot end with '/'

warning: no previously-included files matching '*.pyc' found under directory '.'

zip_safe flag not set; analyzing archive contents...
Adding ete2 2.1rev539 to easy-install.pth file
Installing ete2 script to C:\Python27\Scripts

Installed c:\python27\lib\site-packages\ete2-2.1rev539-py2.7.egg
Processing dependencies for ete2
Finished processing dependencies for ete2

【问题讨论】：

@Oded，我猜是用 python:)
@Oded 我只是想看看它的树形结构是什么样子的。基本上我正在使用 python 包，它将html doc 作为解析树处理。所以我想看看它的树结构。因此，如果您能提供同样的帮助，我会有所帮助！
我不能，因为我不是 python 人（现在你现在为什么应该用语言标记问题）。我也不清楚您希望如何查看解析树 - 您也需要对此进行扩展。
@Oded 只是我想看看它在tree like structure 中的样子？而已。不需要像树一样在 python 中。 python 也以标准方式生成它。它应该是一个自顶向下的解析树
为什么不编辑问题并将这些详细信息添加到其中？

标签： python python-2.7 beautifulsoup parse-tree etetoolkit

【解决方案1】：

这个答案来得有点晚，但我还是想分享一下：

我使用了networkx 和lxml（我发现它们可以更优雅地遍历 DOM 树）。但是，树布局取决于安装的graphviz 和pygraphviz。 networkx 本身只会以某种方式将节点分布在画布上。代码实际上比要求的要长，因为我自己绘制标签以将它们装箱（networkx 提供了绘制标签，但它没有将 bbox 关键字传递给 matplotlib）。

import networkx as nx
from lxml import html
import matplotlib.pyplot as plt
from networkx.drawing.nx_agraph import graphviz_layout

raw = "...your raw html"

def traverse(parent, graph, labels):
    labels[parent] = parent.tag
    for node in parent.getchildren():
        graph.add_edge(parent, node)
        traverse(node, graph, labels)

G = nx.DiGraph()
labels = {}     # needed to map from node to tag
html_tag = html.document_fromstring(raw)
traverse(html_tag, G, labels)

pos = graphviz_layout(G, prog='dot')

label_props = {'size': 16,
               'color': 'black',
               'weight': 'bold',
               'horizontalalignment': 'center',
               'verticalalignment': 'center',
               'clip_on': True}
bbox_props = {'boxstyle': "round, pad=0.2",
              'fc': "grey",
              'ec': "b",
              'lw': 1.5}

nx.draw_networkx_edges(G, pos, arrows=True)
ax = plt.gca()

for node, label in labels.items():
        x, y = pos[node]
        ax.text(x, y, label,
                bbox=bbox_props,
                **label_props)

ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
plt.show()

如果您喜欢（或必须）使用 BeautifulSoup，请更改代码：

我不是专家...只是第一次看 BS4，...但它有效：

#from lxml import html
from bs4 import BeautifulSoup
from bs4.element import NavigableString

...

def traverse(parent, graph, labels):
    labels[hash(parent)] = parent.name
    for node in parent.children:
        if isinstance(node, NavigableString):
            continue
        graph.add_edge(hash(parent), hash(node))
        traverse(node, graph, labels)

...

#html_tag = html.document_fromstring(raw)
soup = BeautifulSoup(raw)
html_tag = next(soup.children)

...

【讨论】：

我需要安装什么python packages，我可以使用easy_install吗？
你需要networkx、matplotlib、graphviz、pygraphviz、lxml。我从 Ubuntu 12.10 的包管理器中轻松安装了所有这些。
您可以将 graphviz 作为 Windows 二进制文件下载 - 我刚刚检查过。但是 lxml 您必须从源代码构建并提供所需的依赖项（libxml2、libxslt）。在 Windows 上从源代码构建和链接本质上是困难的......所以，老实说，我建议你跳过 lxml。这里只需要它来解析和遍历 HTML。相反，您可以使用 beatfulsoup。其余的应通过pip install 或easy_install 提供。你还需要 numpy。
是的，BS4 我已经安装了。那么同样需要任何代码更改吗？
好吧，def traverse(...) 需要更改才能使用 BS4-API。我可能会研究一下，但到目前为止我还没有使用它......

【解决方案2】：

Python 模块：
1.ETE，但需要Newick格式数据。
2.GraphViz + pydot。见this SO answer。

Javascript：
神奇的d3 TreeLayout，它使用JSON格式。

如果您使用的是 ETE，那么您需要将 html 转换为 newick 格式。这是我做的一个小例子：

from lxml import html
from urllib import urlopen


def getStringFromNode(node):
    # Customize this according to
    # your requirements.
    node_string = node.tag
    if node.get('id'):
        node_string += '-' + node.get('id')
    if node.get('class'):
        node_string += '-' + node.get('class')
    return node_string


def xmlToNewick(node):
    node_string = getStringFromNode(node)
    nwk_children = []
    for child in node.iterchildren():
        nwk_children.append(xmlToNewick(child))
    if nwk_children:
        return "(%s)%s" % (','.join(nwk_children), node_string)
    else:
        return node_string


def main():
    html_page = html.fromstring(urlopen('http://www.google.co.in').read())
    newick_page = xmlToNewick(html_page)
    return newick_page

main()

输出（http://www.google.co.in newick 格式）：

'((meta,title,script,style,style,script)head,(script,textarea-csi,(((b-gb1,a-gb1,a-gb1,a-gb1,a-gb1,a-gb1,a-gb1,a-gb1,(u)a-gb1)nobr)div-gbar,((span-gbn-gbi,span-gbf-gbf,span-gbe,a-gb4,a-gb4,a-gb_70-gb4)nobr)div-guser,div-gbh,div-gbh)div-mngb,(br-lgpd,(((div)div-hplogo)div,br)div-lga,(((td,(input,input,input,(input-lst)div-ds,br,((input-lsb)span-lsbb)span-ds,((input-lsb)span-lsbb)span-ds)td,(a,a)td-fl sblc)tr)table,input-gbv)form,div-gac_scont,(br,((a,a,a,a,a,a,a,a,a)font-addlang,br,br)div-als)div,(((a,a,a,a,a-fehl)div-fll)div,(a)p)span-footer)center,div-xjsd,(script)div-xjsi,script)body)html'

之后，您可以按照示例中所示使用 ETE。

希望对您有所帮助。

【讨论】：

这个python可以用来生成html代码的图形视图吗？
@AlexL 没有windows版本，我用的是windows-7
您需要 numpy 等 - 最简单的方法是使用 pip install numpy 下载它们
@AlexL 是的，我刚刚做到了。请看我更新的描述！它要求安装的一些文件。但是我没有安装。所以会有什么问题吗？
@PythonLikeYOU 使用将 htm 转换为 newick 格式的代码更新了答案。