如何使用 Python 过滤 html 标签答案

【问题标题】：How to filter html tags with Python如何使用 Python 过滤 html 标签
【发布时间】：2017-04-13 08:32:13
【问题描述】：

我有一篇文章的 html 文档。我有一些标签，可用于文本格式。但是我的文本编辑器使用了很多不必要的标签进行格式化。我想用 Python 编写一个程序来过滤这些标签。这样一个程序的主要逻辑（结构、策略）是什么？我是 Python 的初学者，想通过解决实际的实际任务来学习这种语言。但我需要一些总体概述才能开始。

【问题讨论】：

标签： python html text filter

【解决方案1】：

使用BeautifulSoup:

from BeautifulSoup import BeautifulSoup
html_string = # the HTML code
parsed_html = BeautifulSoup(html_string)
print parsed_html.body.find('div', attrs = {attrs inside html code}).text

这里，div 只是标签，您可以使用任何要过滤其文本的标签。

【讨论】：

【解决方案2】：

您的要求不是很清楚，但您应该在 python 中使用现成的解析器，例如 BeautifulSoup。

您可以找到tutorial here

【讨论】：

【解决方案3】：

只是不知道会错过什么，但你可以使用正则表达式。

re.sub('<[^<]+?>', '', text)

上面的函数会搜索...

否则你可以使用 htmlparser

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def handle_entityref(self, name):
        self.fed.append('&%s;' % name)
    def get_data(self):
        return ''.join(self.fed)

def html_to_text(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

【讨论】：