【问题标题】:How to do a Breadth First Search easily with beautiful soup?如何用漂亮的汤轻松进行广度优先搜索?
【发布时间】:2017-06-28 09:31:37
【问题描述】:

我正在尝试在美丽的汤树上进行呼吸优先搜索。我知道,我们可以像这样使用 Beautiful soup 进行深度优先搜索:

html = """SOME HTML FILE"""

soup = BeautifulSoup(html)

for child in soup.recursiveChildGenerator():
    # do some stuff here
    pass

但我不知道如何进行呼吸优先搜索,任何人有任何想法,建议?

感谢您的帮助。

【问题讨论】:

    标签: python beautifulsoup tree-search


    【解决方案1】:

    使用the .children generator 将每个元素附加到您的广度优先队列:

    from bs4 import BeautifulSoup
    import requests
    
    html = requests.get("https://stackoverflow.com/questions/44798715/").text
    soup = BeautifulSoup(html, "html5lib")
    queue = [([], soup)]  # queue of (path, element) pairs
    while queue:
        path, element = queue.pop(0)
        if hasattr(element, 'children'):  # check for leaf elements
            for child in element.children:
                queue.append((path + [child.name if child.name is not None else type(child)],
                              child))
        # do stuff
        print(path, repr(element.string[:50]) if element.string else type(element))
    

    【讨论】:

      【解决方案2】:

      使用 DFS 或 BFS 浏览 BeautifulSoup 解析的 HTML 文档:

      solution.py

      import bs4
      from bs4 import BeautifulSoup
      
      html = """
      <div>root
           <div>child1
                <div>child4
                </div>
                <div>child5
                </div>
           </div>
           <div>child2
           </div>
           <div>child3
                <div>child6
                </div>
           </div>
      </div>
      """
      

      将这些行添加到 solution.py 中:

      def visit(node):
          if isinstance(node, bs4.element.Tag):
              # be careful bs4.element subclass ...
              print(type(node), 'tag:', node.name)
          elif isinstance(node, bs4.element.NavigableString):
              # be careful bs4.CDdata and bs4.element.Comment subclass ...
              print(type(node), repr(node.string))
          else:
              print(type(node), 'UNKNOWN')
      

      还有:

      def dfs(html):
          bs = BeautifulSoup(html, 'html.parser')
          # <class 'bs4.BeautifulSoup'> [document]
          visit(bs)
          for child in bs.recursiveChildGenerator():
              visit(child)
      
      
      def bfs(html):
          bs = BeautifulSoup(html, 'html.parser')
          # <class 'bs4.BeautifulSoup'> [document]
          visit(bs)
          for child in recursiveChildGeneratorBfs(bs):
              visit(child)
      
      
      def recursiveChildGeneratorBfs(bs):
          root = bs
          stack = [root]
          while len(stack) != 0:
              node = stack.pop(0)
              if node is not bs:
                  yield node
              if hasattr(node, 'children'):
                  for child in node.children:
                      stack.append(child)
      

      ipython 控制台中:

      In [1]: run solution.py
      

      BFS:

      In [2]: bfs(html)
      <class 'bs4.BeautifulSoup'> tag: [document]
      <class 'bs4.element.NavigableString'> '\n'
      <class 'bs4.element.Tag'> tag: div
      <class 'bs4.element.NavigableString'> '\n'
      <class 'bs4.element.NavigableString'> 'root\n     '
      <class 'bs4.element.Tag'> tag: div
      <class 'bs4.element.NavigableString'> '\n'
      <class 'bs4.element.Tag'> tag: div
      <class 'bs4.element.NavigableString'> '\n'
      <class 'bs4.element.Tag'> tag: div
      <class 'bs4.element.NavigableString'> '\n'
      <class 'bs4.element.NavigableString'> 'child1\n          '
      <class 'bs4.element.Tag'> tag: div
      <class 'bs4.element.NavigableString'> '\n'
      <class 'bs4.element.Tag'> tag: div
      <class 'bs4.element.NavigableString'> '\n'
      <class 'bs4.element.NavigableString'> 'child2\n     '
      <class 'bs4.element.NavigableString'> 'child3\n          '
      <class 'bs4.element.Tag'> tag: div
      <class 'bs4.element.NavigableString'> '\n'
      <class 'bs4.element.NavigableString'> 'child4\n          '
      <class 'bs4.element.NavigableString'> 'child5\n          '
      <class 'bs4.element.NavigableString'> 'child6\n          '
      

      DFS:

      In [3]: dfs(html)
      <class 'bs4.BeautifulSoup'> tag: [document]
      <class 'bs4.element.NavigableString'> '\n'
      <class 'bs4.element.Tag'> tag: div
      <class 'bs4.element.NavigableString'> 'root\n     '
      <class 'bs4.element.Tag'> tag: div
      <class 'bs4.element.NavigableString'> 'child1\n          '
      <class 'bs4.element.Tag'> tag: div
      <class 'bs4.element.NavigableString'> 'child4\n          '
      <class 'bs4.element.NavigableString'> '\n'
      <class 'bs4.element.Tag'> tag: div
      <class 'bs4.element.NavigableString'> 'child5\n          '
      <class 'bs4.element.NavigableString'> '\n'
      <class 'bs4.element.NavigableString'> '\n'
      <class 'bs4.element.Tag'> tag: div
      <class 'bs4.element.NavigableString'> 'child2\n     '
      <class 'bs4.element.NavigableString'> '\n'
      <class 'bs4.element.Tag'> tag: div
      <class 'bs4.element.NavigableString'> 'child3\n          '
      <class 'bs4.element.Tag'> tag: div
      <class 'bs4.element.NavigableString'> 'child6\n          '
      <class 'bs4.element.NavigableString'> '\n'
      <class 'bs4.element.NavigableString'> '\n'
      <class 'bs4.element.NavigableString'> '\n'
      

      见:

      Documentation

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2011-01-31
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2016-02-16
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多