通过迭代从文本文件中提取html标签并将它们附加到列表中并忽略python中的所有其他字符答案

【问题标题】：Extract html tags from a text file through iteration and append them to a list and ignore all other characters in python通过迭代从文本文件中提取html标签并将它们附加到列表中并忽略python中的所有其他字符
【发布时间】：2019-02-13 19:40:06
【问题描述】：

我希望能够读取 html 文件并仅从中提取标签。

从文件中一次读取一个字符，忽略所有内容以获取“

一次读取一个字符，将它们附加到一个字符串直到“>”或空格（也可以忽略“>”）

  <html>
   <body>
   <h1>This is test</h1>
   <h2> This is test 2<h2>
   </body>
   <html>


   with open('doc.txt', 'r') as f:
            all_lines = []
            # loop through all lines using f.readlines() method
            for line in f.readlines():
                new_line = []
                # this is how you would loop through each alphabet
                for chars in line:
                    new_line.append(chars)
                all_lines.append(new_line)

            print(all_lines)

我可以遍历文本文件，可以得到如下列表：

[['', '\n'], ['', '\n'], ['', '\n'] , ['']]

但预期的输出应该是：[html,body,h1,/h1,/h2,/body,/html]

【问题讨论】：

标签： python python-3.x html-parsing

【解决方案1】：

In [10]: re.findall('<(.*?)>', html)
Out[10]: ['html', 'body', 'h1', '/h1', 'h2', 'h2', '/body', '/html']

只需使用 regex 或 HTMLParser。

【讨论】：