如何防止 lxml 删除 doctype答案

【问题标题】：How to prevent lxml removing doctype如何防止 lxml 删除 doctype
【发布时间】：2020-10-12 19:08:34
【问题描述】：

先说一下上下文。我想要一个自定义的 html 类，我可以在其中美化 html（未在下面的代码中公开）。

我确实喜欢 lxml 库，如果我知道如何使用自定义缩进正确美化 html，我什至不会考虑使用 beautifulsoup，不幸的是我不这样做，所以我想出了这个缓慢而模糊的小片段代码：

import lxml.html
from bs4 import BeautifulSoup


def write_new_line(line, current_indent, indent):
    new_line = ""
    spaces_to_add = (current_indent * indent) - current_indent
    if spaces_to_add > 0:
        for i in range(spaces_to_add):
            new_line += " "
    new_line += str(line) + "\n"
    return new_line


def prettify_html(content, indent=4):
    soup = BeautifulSoup(content, "html.parser")
    pretty_soup = str()
    previous_indent = 0
    for line in soup.prettify().split("\n"):
        current_indent = str(line).find("<")
        if current_indent == -1 or current_indent > previous_indent + 2:
            current_indent = previous_indent + 1
        previous_indent = current_indent
        pretty_soup += write_new_line(line, current_indent, indent)
    return pretty_soup.strip()


class Html:
    def __init__(self, string_or_html):
        if isinstance(string_or_html, str):
            self.html = lxml.html.fromstring(string_or_html)
        else:
            self.html = string_or_html

    def __str__(self):
        return prettify_html(lxml.html.tostring(self.html).decode("utf-8"), indent=4)


if __name__ == "__main__":
    import textwrap

    html = textwrap.dedent(
        """
        <!DOCTYPE html>
        <html lang="en">
            <head>
            </head>
            <body>
            </body>
        </html>
    """
    ).strip()

    print("broken_code".center(80, "-"))
    print(Html(html))

    print("good_code".center(80, "-"))
    print(prettify_html(html))

如下图所示，当前类很容易导致吐出坏代码：

----------------------------------broken_code-----------------------------------
<html lang="en">
    <head>
    </head>
    <body>
    </body>
</html>
-----------------------------------good_code------------------------------------
<!DOCTYPE html>
<html lang="en">
    <head>
    </head>
    <body>
    </body>
</html>

您可能会争辩说这不是代码损坏，但根据我的经验，没有原始文档类型可能很容易导致呈现 html 的问题。

所以问题是：

a) 如何在不丢失原始信息并允许使用 lxml 自定义缩进的情况下美化我的 html

或

b) 如何防止 lxml 删除原始信息，以便 BeautifulSoup 始终如一地进行美化？

【问题讨论】：

如果最终需要调用BeautifulSoup，为什么还要单独使用呢？为什么不直接使用soup = BeautifulSoup(content, "lxml")？
这真的很有趣，甚至不知道这是可能的。正如我在问题中提到的，为什么使用 lxml 是因为我想用它来完成其他任务，所以摆脱 beautifulsoup 依赖，同时仍然能够美化 html（使用适当的缩进）是这里的主要目标......我将阅读它以了解确切的作用，尽管从名称看来 BS 会以某种方式使用 lxml 作为后端解析器

标签： python beautifulsoup lxml

【解决方案1】：

虽然我没有完全理解你的意思，但这里有一个你可能想要的实现：

class Html:
    def __init__(self, string_or_html):
        if isinstance(string_or_html, str):
            self.html = lxml.html.fromstring(string_or_html)
        else:
            self.html = string_or_html

    def __str__(self):
        doctype = self.html.getroottree().docinfo.doctype
        return lxml.html.tostring(self.html, pretty_print=True, encoding="unicode", doctype=doctype)

【讨论】：

这真的很有趣，我稍后会测试它，但从名称看来它会保留原始 html 内容？太棒了……你知道如何在不使用 BeautifulSoup 的情况下美化字符串吗？无论如何，正如我所说，我稍后会对其进行测试......同时，+1（假设它会保留原始文档类型）