以下是一些示例,说明如何从 XML/HTML 树中删除和解析不同类型的 HTML 元素。
关键建议:它有助于不依赖外部库并在“native python 2”中完成所有事情 /3 代码”。
这里有一些示例,说明如何使用“原生”python 执行此操作...
# (REMOVE <SCRIPT> to </script> and variations)
pattern = r'<[ ]*script.*?\/[ ]*script[ ]*>' # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
# (REMOVE HTML <STYLE> to </style> and variations)
pattern = r'<[ ]*style.*?\/[ ]*style[ ]*>' # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
# (REMOVE HTML <META> to </meta> and variations)
pattern = r'<[ ]*meta.*?>' # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
# (REMOVE HTML COMMENTS <!-- to --> and variations)
pattern = r'<[ ]*!--.*?--[ ]*>' # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
# (REMOVE HTML DOCTYPE <!DOCTYPE html to > and variations)
pattern = r'<[ ]*\![ ]*DOCTYPE.*?>' # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
注意:
re.IGNORECASE # is needed to match case sensitive <script> or <SCRIPT> or <Script>
re.MULTILINE # is needed to match newlines
re.DOTALL # is needed to match "special characters" and match "any character"
我已经在几个不同的 HTML 文件(包括 、 和
注意:它也确实不依赖 beautifulsoup 或任何其他外部下载的库!
希望这会有所帮助!
:)