如何在python中删除html标签中的文本？ [复制]答案

【问题标题】：How to remove texts within html tags in python? [duplicate]如何在python中删除html标签中的文本？ [复制]
【发布时间】：2012-09-29 00:35:09
【问题描述】：

可能重复：
Strip html from strings in python

在制作类似应用程序的小型浏览器时，我面临拆分不同标签的问题。考虑字符串

<html> <h1> good morning </h1> welcome </html>

我需要以下输出： ['早上好','欢迎']

如何在 python 中做到这一点？

【问题讨论】：

标签： python html

【解决方案1】：

我会使用xml.etree.ElementTree:

def get_text(etree):
    for child in etree:
        if child.text:
           yield child.text
        if child.tail:
           yield child.tail

import xml.etree.ElementTree as ET
root = ET.fromstring('<html> <h1> good morning </h1> welcome </html>')
print list(get_text(root))

【讨论】：

【解决方案2】：

您可以使用 pythons html / xml 解析器之一。

美丽的汤很受欢迎。 lmxl 也很受欢迎。

以上是您也可以使用标准库的第三方包

http://docs.python.org/library/xml.etree.elementtree.html

【讨论】：

【解决方案3】：

我会使用 python 库Beautiful Soup 来实现您的目标。在它的帮助下只需几行代码：

from bs4 import BeautifulSoup
soup = BeautifulSoup('<html> <h1> good morning </h1> welcome </html>')
print [text for text in soup.stripped_strings]

【讨论】：