【问题标题】:Error in the name while fetching it from the website从网站获取名称时出现错误
【发布时间】:2020-10-02 15:18:17
【问题描述】:

当我运行代码并打印交易时,有些交易名称不正确,例如

respData = urllib.request.urlopen(
    'https://www.mcdelivery.com.pk/pk/browse/menu.html')

resp = respData.read().decode('utf-8')

link = re.findall(r'<ul class="secondary-menu">(.*?)</ul>', str(resp))
# URLS
Urls = re.findall("href=[\"\'](.*?)[\"\']", str(link))

# remove amp from the urls
Url1 = [re.sub(r'amp;', '', item) for item in Urls]
# menu
deals = re.findall(r'<span>(.*?)</span>', str(link))
print(deals)

代码输出:

['Deals', "\\\\xe2\\\\x98\\\\x85What\\\\\\'s New\\\\xe2\\\\x98\\\\x85", '\\\\xc3\\\\x80la carte & Value Meals', 'Crispy Chicken', 'Share Box', 'Happy Meals', 'Desserts', 'McCaf\\\\xc3\\\\xa9', 'Beverages', 'Side Lines', 'Snack Time']

\\xe2\\x98\\x85What\\\'s New\\xe2\\x98\\x85 这应该是What's New 并且这个\\xc3\\x80la carte &amp; Value Meals 应该是la carte &amp; value meals

【问题讨论】:

  • 显示的输出没有意义,因为“re.findall”返回一个列表,该列表应该打印为列表表示形式(带括号、引号等)。
  • ['Deals', "\\\\xe2\\\\x98\\\\x85What\\\\\\'s New\\\\xe2\\\\x98\\\\x85", '\\\\xc3\\\\x80la carte &amp; Value Meals', 'Crispy Chicken', 'Share Box', 'Happy Meals', 'Desserts', 'McCaf\\\\xc3\\\\xa9', 'Beverages', 'Side Lines', 'Snack Time'] 输出是这个 我想删除反斜杠和它的所有编码

标签: python regex utf-8 urllib


【解决方案1】:

据我了解,您想从字符串中删除任何非 utf8 字符。只需在deals = re.findall(...)之后添加以下行

deals = list(map(lambda line: line.decode('utf-8','ignore').encode("utf-8"), deals))

【讨论】:

  • 有一个错误in &lt;lambda&gt; deals = list(map(lambda line: line.decode( AttributeError: 'str' object has no attribute 'decode'
猜你喜欢
  • 2018-12-27
  • 1970-01-01
  • 2018-06-26
  • 2012-04-12
  • 2010-12-11
  • 1970-01-01
  • 1970-01-01
  • 2014-07-30
  • 1970-01-01
相关资源
最近更新 更多