【问题标题】:python lxml.html.soupparser.fromstring raising annoying warningpython lxml.html.soupparser.fromstring 引发恼人的警告
【发布时间】:2016-12-21 07:19:32
【问题描述】:

我的代码...

foo = fromstring(my_html)

它引发了这个警告......

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))

我尝试将字符串 'html.parser' 传递给它,但这不起作用,因为它给了我一个错误,说字符串不是可调用的,所以我尝试了 html.parser,然后我查看了 lxml 模块,看看我是否可以找到另一个解析器,但不能。我查看了 python 标准库,发现在 2.7 中有一个名为HTMLParser,所以我导入了它并输入了beautifulsoup=HTMLParser,但这也不起作用。

我应该传递给fromstring 的可调用对象在哪里?

EDIT 添加了尝试的解决方案:

from lxml.html.soupparser import fromstring
wiktionary_page = fromstring(wiktionary_page.read(), features="html.parser" )

还有这个

from lxml.html.soupparser import BeautifulSoup
wiktionary_page = fromstring(wiktionary_page.read(), beautifulsoup=lambda s: BeautifulSoup(s, "html.parser"))

【问题讨论】:

    标签: python beautifulsoup lxml


    【解决方案1】:

    您可以传递 features 关键字来设置解析器。

    tree = lxml.html.soupparser.fromstring("<p>foo</p>", features="html.parser" )
    

    fromstring 中发生的事情是 _parser 被调用,但我认为 bsargs['features'] = ['html.parser '],应该是bsargs['features'] = 'html.parser':

    def _parse(source, beautifulsoup, makeelement, **bsargs):
        if beautifulsoup is None:
            beautifulsoup = BeautifulSoup
        if hasattr(beautifulsoup, "HTML_ENTITIES"):  # bs3
            if 'convertEntities' not in bsargs:
                bsargs['convertEntities'] = 'html'
        if hasattr(beautifulsoup, "DEFAULT_BUILDER_FEATURES"):  # bs4
            if 'features' not in bsargs:
                bsargs['features'] = ['html.parser']  # use Python html parser
        tree = beautifulsoup(source, **bsargs)
        root = _convert_tree(tree, makeelement)
        # from ET: wrap the document in a html root element, if necessary
        if len(root) == 1 and root[0].tag == "html":
            return root[0]
        root.tag = "html"
        return root
    

    你也可以使用 lambda:

    from lxml.html.soupparser import BeautifulSoup
    import lxml.html.soupparser
    
    tree = lxml.html.soupparser.fromstring("<p>foo</p>", beautifulsoup=lambda s: BeautifulSoup(s, "html.parser"))
    

    两者都禁止任何警告:

    In [13]: from lxml.html import soupparser
    
    In [14]: tree = soupparser.fromstring("<p>foo</p>", features="html.parser" )
    In [15]: from lxml.html.soupparser import BeautifulSoup
    
    In [16]: import lxml.html.soupparser
    
    
    In [17]: tree = lxml.html.soupparser.fromstring("<p>foo</p>", beautifulsoup=lambda s: BeautifulSoup(s, "html.parser"))
    

    【讨论】:

    • 两者都对我有用,你使用的和发布的完全一样吗?
    • 我添加了我尝试过的,据我所知在功能上与您的相同
    • @deltaskelta,你使用的是什么版本的 lxml,我看不出你怎么可能仍然看到警告,特别是在第二个例子中,因为没有其他调用 bs4 bar 里面的那个拉姆达
    • lxml 3.6 在 python 2.7 中。我在 django 测试环境中运行这些命令,这就是为什么输出如此烦人的原因(堵塞了我的终端,所以我无法清楚地看到测试输出
    • 我不知道为什么最初的答案对我不起作用,但我重新审视了这些问题并且这个答案奏效了。谢谢
    猜你喜欢
    • 2010-12-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-10-30
    • 2012-01-12
    • 2018-08-03
    相关资源
    最近更新 更多