使用 Python 从各种 HTML 中提取文本答案

【问题标题】：Extract Text from Varied HTML using Python使用 Python 从各种 HTML 中提取文本
【发布时间】：2018-12-08 05:08:56
【问题描述】：

假设你有一个不同的 HTML 块，如下所示：

<div class="container">
  <div class="sub-container">
    <a href="example.com">Blue</a>
  </div>
  Black
  </br>
  <div class="sub-container">
    <a href="example.com">Yellow</a>
  </div>
  <div class="sub-container">
    <a href="example.com">Pink</a>
  </div>
  Orange
  </br>
</div>

使用 python 从这个 HTML 块中提取颜色的方法是什么？

【问题讨论】：

为什么不加评论就标记为否定？
也许反对者（不是我）认为你应该解释问题是什么以及你自己的方法是什么。

标签： python regex beautifulsoup lxml

【解决方案1】：

您可以使用.text 从您的示例 html 中获取所有颜色。

例如：

from bs4 import BeautifulSoup
s = """<div class="container">
  <div class="sub-container">
    <a href="example.com">Blue</a>
  </div>
  Black
  </br>
  <div class="sub-container">
    <a href="example.com">Yellow</a>
  </div>
  <div class="sub-container">
    <a href="example.com">Pink</a>
  </div>
  Orange
  </br>
</div>"""
soup = BeautifulSoup(s, "html.parser")
print(soup.text.strip().replace(" ", ""))

输出：

Blue

Black


Yellow


Pink

Orange

【讨论】：

【解决方案2】：

要使用正则表达式提取 html 中的标签，您可能想试试这个：

<(\w+)[\s\w\d=\-+\.]*>(.*)</\1\s*>

然后使用组 2 查找该标记内的所有内容。您还可以将正则表达式的开头更改为：

<(a) (etc...)

这只会匹配一个标签。

【讨论】：