beautifulsoup 解析 html 文件内容答案

【问题标题】：beautifulsoup parse html file contentsbeautifulsoup 解析 html 文件内容
【发布时间】：2017-10-28 15:11:23
【问题描述】：

我在一个文件夹中有 30911 个 html 文件。我需要（1）检查它是否包含标签：

<strong>123</strong>

和（2）提取以下内容，直到本节结束。

但我发现一个问题是其中一些之前结束了

<strong>567</strong>

而且有些没有这个标签，在

之前结束

<strong>89/strong> or others(that I do not know because I cant check 30K+files)

它在每个文件中也有不同的p p_number，有时没有id

所以我先用beautifulsoup搜索，但是不知道接下来怎么提取内容

soup = bs4.BeautifulSoup(fo, "lxml")
m = soup.find("strong", string=re.compile("123"))

顺便说一句，是否可以将内容保存为 txt 格式，但它看起来像 html 格式？

line 1
line 2
...
lin 50

如果使用 p.get_text(strip=true)，那就是在一起了。

line1 content line2 content ... 
line50 content....

【问题讨论】：

标签： python html parsing web-scraping beautifulsoup

【解决方案1】：

如果我对您的理解正确，您可以先找到起点 - 一个 p 元素，该元素有一个带有“问答会话”文本的 strong 元素。然后，您可以遍历p 元素的next siblings，直到找到具有“版权政策”文本的strong 元素。

完整的可复制示例：

import re

from bs4 import BeautifulSoup


data = """
<body>
    <p class="p p4" id="question-answer-session">
      <strong>
       Question-and-Answer Session
      </strong>
    </p>

    <p class="p p4">
       Hi John and Greg, good afternoon. contents....
    </p>

    <p class="p p14">
      <strong>
       Copyright policy:
      </strong>
      other content about the policy....
    </p>
</body>
"""

soup = BeautifulSoup(data, "html.parser")

def find_question_answer(tag):
    return tag.name == 'p' and tag.find("strong", text=re.compile(r"Question-and-Answer Session"))

question_answer = soup.find(find_question_answer)
for p in question_answer.find_next_siblings("p"):
    if p.find("strong", text=re.compile(r"Copyright policy")):
        break

    print(p.get_text(strip=True))

打印：

Hi John and Greg, good afternoon. contents....

【讨论】：

如果我将内容写入一个新的 html 文件，格式会很混乱。
@MichaelLin 好的，你想写入文件的哪一部分？
我想我解决了，我使用 p.prettify().encode('ascii', 'ignore').decode('utf-8', 'ignore') 然后它只保存版权前的内容
但是正如我在问题中提到的，还有另一个标签“相关：”，所以它可能是“版权”或“相关”，无论如何要解决它？
@MichaelLin 一种选择是调整正则表达式：re.compile(r"(Copyright policy|related)")..