我想继承 Beautifulsoup 类来完成以下任务答案

【问题标题】：I want to inherit from Beautifulsoup Class to do the following tasks我想继承 Beautifulsoup 类来完成以下任务
【发布时间】：2016-09-23 21:13:54
【问题描述】：

我在 Python 3.5.1 上运行 Beautifulsoup4。

我目前有这个代码：

from bs4 import BeautifulSoup
import html5lib

class LinkFinder(BeautifulSoup):

  def __init__(self):
    super().__init__()

  def handle_starttag(self, name, attrs):
    print(name)

当我通过以下代码实例化类时： findmylink = LinkFinder() 当我使用以下代码加载我的 html 时 findmylink.feed("""<html><head><title>my name is good</title></head><body>hello world</body></html>""",'html5lib').

我的控制台出现以下错误：

'NoneType' object is not callable

我实际上希望复制以下示例代码（在我的例子中，我希望使用 Beautifulsoup 而不是 html.parser）

from html.parser import HTMLParser
class LinkFinder(HTMLParser):

    def __init__(self):
        super().__init__()

   def handle_starttag(self, tag, attrs):
        print(tag)

当我通过以下代码重新实例化该类时：findmylink = LinkFinder() 并且当我使用以下代码 findmylink.feed("""<html><head><title>my name is good</title></head><body>hello world</body></html>""") 加载我的 html 时，我得到以下输出：

html
head
title
body

这是所需的输出。

【问题讨论】：

标签： python html python-3.x web-scraping beautifulsoup

【解决方案1】：

不知道为什么需要以这种不常见的方式使用BeautifulSoup。如果您想简单地递归获取 HTML 树中所有元素的名称：

from bs4 import BeautifulSoup

data = """<html><head><title>my name is good</title></head><body>hello world</body></html>"""

soup = BeautifulSoup(data, "html5lib")
for elm in soup.find_all():
    print(elm.name)

打印：

html
head
title
body

【讨论】：

感谢您的投入，我不喜欢这种方法，因为我需要可以随时调用的函数。

【解决方案2】：

如果您想这样做，请更改您的实现以在初始化期间接受标记并handle_starttag 获取所有传递的参数：

class LinkFinder(BeautifulSoup):

  def __init__(self, markup):
    super().__init__(markup, 'html.parser')

  def handle_starttag(self, name, namespace, nsprefix, attrs):
    print(name)

初始化：

l = LinkFinder("""<html><head><title>my name is good</title></head><body>hello world</body></html>""")

打印出来：

html
head
title
body

我很确定 BeautifulSoup 类已重载 __getattr__ 以在未定义的属性上返回 None 而不是引发 AttributeError;这就是导致您的错误的原因：

print(type(BeautifulSoup().feed))
NoneType
print(type(BeautifulSoup().feedededed))
NoneType

并且，BeautifulSoup 没有 feed 函数，就像 HTMLParser 那样（它确实有一个 _feed 并且使用 builder 调用 builder 对象的底层 feed）所以你会得到一个你调用的None 对象。

【讨论】：

你太棒了。但是，我发现我只能在此实现中使用“html.parser”，而其他解析器（如“html5lib”）不会向屏幕输出任何内容。虽然，我会坚持你的答案，但你知道为什么“html5lib”不起作用吗？