使用 Python 从页面中删除 HTML 标记内容答案

【问题标题】：Remove HTML tag contents from page using Python使用 Python 从页面中删除 HTML 标记内容
【发布时间】：2021-06-21 13:56:29
【问题描述】：

我有一个如下所示的 HTML 文件：

<!DOCTYPE HTML>
<html>

<head>

<title>Sezione microbiologia</title>
<link rel="stylesheet" src="./style.css">

</head>

<body>

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Prima diluizione</h1>
        <p>Some content including "prima diluizione"...</p>
        <h1>Seconda diluizione</h1>
        <p>Some content including "seconda diluizione"...</p>
        <h1>Terza diluizione</h1>
        <p>Some content including "terza diluizione"...</p>
    </section>

    <section id="second">
        <!-- SOME CONTENT... -->
    </section>

    <section id="third">
        <!-- SOME CONTENT... -->
    </section>

    <section id="footer">
        <!-- SOME CONTENT... -->
    </section>
</div>
</body>

</html>

问题描述：

我正在尝试修改包含单词diluizione 的标题<h1> 以将该单词及其前缀替换为“Diluizione seriale”。我尝试使用 Python replace() 来做到这一点，问题是即使 <p> 段落中的行被截断，而我只想修改 h1 标记中的行。最重要的是，我还没有找到自动取出前缀的方法，即“Prima”、“Seconda”、“Terza”等。

我尝试过的代码

我目前想出了这个：

with open('./home.html') as file:
    text = file.read()


if "diluizione" in text:
    text = text.replace("diluizione", "diluizione seriale")

但是这个输出：

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Prima diluizione seriale</h1>
        <p>Some content including "prima diluizione seriale"...</p>
        <h1>Seconda diluizione seriale</h1>
        <p>Some content including "seconda diluizione seriale"...</p>
        <h1>Terza diluizione seriale</h1>
        <p>Some content including "terza diluizione seriale"...</p>
    </section>

如您所见，即使<p> 标签中的文本也会受到影响，并且前缀仍然存在。

我的想要的输出是：

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Diluizione seriale</h1>
        <p>Some content including "prima diluizione"...</p>
        <h1>Diluizione seriale</h1>
        <p>Some content including "seconda diluizione"...</p>
        <h1>Diluizione seriale</h1>
        <p>Some content including "terza diluizione"...</p>
    </section>

非常感谢任何帮助或建议，非常感谢。

【问题讨论】：

标签： python html replace

【解决方案1】：

您可以通过 Pythons re 模块使用正则表达式来实现此目的。为了只过滤 h1 标签内的文本，您可以使用 positive lookbehind 和 positive lookahead 策略。

代码：

import re

with open("path/to/home.html") as file:
    text = file.read()

text = re.sub("(?<=<h1>)\w+ \w+(?=</h1>)", "Diluizione seriale", text)

print(text)

解释：

正则表达式(?<=<h1>)\w+ \w+(?=</h1>) 匹配包含在<h1> 和</h1> 之间的两个连续单词字符。

输出：

<!-- SOME CONTENT... -->
<h1>Diluizione seriale</h1>
<p>Some content including "prima diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "seconda diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "terza diluizione"...</p>

【讨论】：

【解决方案2】：

看看html.parser。与其尝试进行 sting 插值，不如将 HTML 解析为一个结构，然后从那里遍历它

【讨论】：

感谢您的回答！我选择了正则表达式策略，但肯定会看看 html 解析器。感谢您的宝贵时间。