在python中替换HTML代码答案

【问题标题】：Replacing HTML code in python在python中替换HTML代码
【发布时间】：2025-12-19 03:35:16
【问题描述】：

我正在使用正则表达式来解析网站的源代码并在Tkinter 窗口中显示新闻标题。有人告诉我用正则表达式解析 HTML 不是最好的主意，但遗憾的是现在没有时间进行更改。

我似乎无法为特殊字符（例如撇号 (')）替换 HTML 代码。

目前我有以下：

union_url = 'http://www.news.com.au/sport/rugby'

def union():
    union_string = urlopen(union_url).read()
    union_string.replace("&#8217;", "'")
    union_headline = re.findall('(?:sport/rugby/.*) >(.*)<', union_string)
    union_headline_label= Label(union_window, text = union_headline[0], font=('Times',20,'bold'),  bg = 'White', width = 85, height = 3, wraplength = 500)

这并没有摆脱 HTML 字符。例如，标题打印为

Larkham: Real worth of &#8216;Giteau&#8217;s Law&#8217;

我试图在没有任何运气的情况下找到答案。非常感谢任何帮助。

【问题讨论】：

您是在尝试从 html 源获取数据还是解析数据？？
抱歉 - 获取数据以显示在 tkinter 小部件上
听说过beautiful soup，你的生活会变得更好...解析 HTML 可能很困难。

标签： python html regex replace tkinter

【解决方案1】：

您可以使用 re.sub() 的“可调用”功能来取消转义（或删除）任何转义的内容：

>>> import re
>>> def htmlUnescape(m):
...     return unichr(int(m.group(1), 16))
...
>>> re.sub('&#([^;]+);', htmlUnescape, "This is something &#8217; with an HTML-escaped character in it.")
u'This is something \u8217 with an HTML-escaped character in it.'
>>>

【讨论】：