【发布时间】:2019-02-13 19:40:06
【问题描述】:
我希望能够读取 html 文件并仅从中提取标签。
- 从文件中一次读取一个字符,忽略所有内容以获取“
-
一次读取一个字符,将它们附加到一个字符串直到“>”或空格(也可以忽略“>”)
<html> <body> <h1>This is test</h1> <h2> This is test 2<h2> </body> <html> with open('doc.txt', 'r') as f: all_lines = [] # loop through all lines using f.readlines() method for line in f.readlines(): new_line = [] # this is how you would loop through each alphabet for chars in line: new_line.append(chars) all_lines.append(new_line) print(all_lines)
我可以遍历文本文件,可以得到如下列表:
[['', '\n'], ['', '\n'], ['', '\n'] , ['']]
但预期的输出应该是:[html,body,h1,/h1,/h2,/body,/html]
【问题讨论】:
标签: python python-3.x html-parsing