【问题标题】:regex combining groups into one string正则表达式将组组合成一个字符串
【发布时间】:2021-06-10 06:01:08
【问题描述】:

所以这里是一个字符串

[{'display_html': "<img src='/images/C_small.png'/> <b><a href='/u/11339405'>GDcheerios</a></b> achieved rank #936 on <a href='/b/1952051?m=0'>frederic - ONLYWONDER [Singing sometimes]</a> (osu!)", 'beatmap_id': 1952051, 'beatmapset_id': '807885', 'date': datetime.datetime(2021, 6, 1, 5, 17, 11, 80000), 'library': '', 'epic_factor': '1'}, {'display_html': "<img src='/images/A_small.png'/> <b><a href='/u/11339405'>GDcheerios</a></b> achieved rank #77 on <a href='/b/2401143?m=0'>Falcom Sound Team jdk - Desert After Tears [Inferno]</a> (osu!)", 'beatmap_id': 2401143, 'beatmapset_id': '1150262', 'date': datetime.datetime(2021, 6, 1, 4, 21, 3, 80000), 'library': '', 'epic_factor': '1'}]

我有一些正则表达式代码可以从中获取我想要的某些部分

\>(\w+)|( achieved rank .\w+ on )|m=0'>(.*? - .*?\])

问题是他们每个商店都在自己的组中。因此,当我打印 .group() 时,它只会导致

GDcheerios

而我想要的是

GDcheerios 在 frederic 上获得第 936 名 - ONLYWONDER [有时唱歌]

【问题讨论】:

  • 你需要删除html标签;)

标签: python python-3.x regex


【解决方案1】:

使用BeautifulSoup解析html

例如:

from bs4 import BeautifulSoup
import datetime

data = [{'display_html': "<img src='/images/C_small.png'/> <b><a href='/u/11339405'>GDcheerios</a></b> achieved rank #936 on <a href='/b/1952051?m=0'>frederic - ONLYWONDER [Singing sometimes]</a> (osu!)", 'beatmap_id': 1952051, 'beatmapset_id': '807885', 'date': datetime.datetime(2021, 6, 1, 5, 17, 11, 80000), 'library': '', 'epic_factor': '1'}, {'display_html': "<img src='/images/A_small.png'/> <b><a href='/u/11339405'>GDcheerios</a></b> achieved rank #77 on <a href='/b/2401143?m=0'>Falcom Sound Team jdk - Desert After Tears [Inferno]</a> (osu!)", 'beatmap_id': 2401143, 'beatmapset_id': '1150262', 'date': datetime.datetime(2021, 6, 1, 4, 21, 3, 80000), 'library': '', 'epic_factor': '1'}]

for i in data:
    s = BeautifulSoup(i['display_html'], 'html.parser')
    print(s.text)

输出:

 GDcheerios achieved rank #936 on frederic - ONLYWONDER [Singing sometimes] (osu!)
 GDcheerios achieved rank #77 on Falcom Sound Team jdk - Desert After Tears [Inferno] (osu!)

【讨论】:

    【解决方案2】:

    findall 的帮助下。可能不是最优雅的,但它确实有效:

    import re
    
    target_string = r"[{'display_html': \"<img src='/images/C_small.png'/> <b><a href='/u/11339405'>GDcheerios</a></b> achieved rank #936 on <a href='/b/1952051?m=0'>frederic - ONLYWONDER [Singing sometimes]</a> (osu!)\", 'beatmap_id': 1952051, 'beatmapset_id': '807885', 'date': datetime.datetime(2021, 6, 1, 5, 17, 11, 80000), 'library': '', 'epic_factor': '1'}, {'display_html': \"<img src='/images/A_small.png'/> <b><a href='/u/11339405'>GDcheerios</a></b> achieved rank #77 on <a href='/b/2401143?m=0'>Falcom Sound Team jdk - Desert After Tears [Inferno]</a> (osu!)\", 'beatmap_id': 2401143, 'beatmapset_id': '1150262', 'date': datetime.datetime(2021, 6, 1, 4, 21, 3, 80000), 'library': '', 'epic_factor': '1'}]"
    
    result = re.findall(r"\>(\w+)|( achieved rank .\w+ on )|m=0'>(.*? - .*?\])", target_string)
    
    for i in range(2):
        print(result[i*3][0] + result[i*3+1][1] + result[i*3+2][2])
    

    输出:

    GDcheerios achieved rank #936 on frederic - ONLYWONDER [Singing sometimes]
    GDcheerios achieved rank #77 on Falcom Sound Team jdk - Desert After Tears [Inferno]
    

    【讨论】:

      【解决方案3】:

      您可以从上述正则表达式生成单个表达式。这将返回一个包含所有匹配项的元组,您可以使用''.join 快速组合这些匹配项。您唯一需要做的就是将管道 (|) 替换为 .+

      html = """
      [{'display_html': "<img src='/images/C_small.png'/> <b><a href='/u/11339405'>GDcheerios</a></b> achieved rank #936 on <a href='/b/1952051?m=0'>frederic - ONLYWONDER [Singing sometimes]</a> (osu!)", 'beatmap_id': 1952051, 'beatmapset_id': '807885', 'date': datetime.datetime(2021, 6, 1, 5, 17, 11, 80000), 'library': '', 'epic_factor': '1'}, {'display_html': "<img src='/images/A_small.png'/> <b><a href='/u/11339405'>GDcheerios</a></b> achieved rank #77 on <a href='/b/2401143?m=0'>Falcom Sound Team jdk - Desert After Tears [Inferno]</a> (osu!)", 'beatmap_id': 2401143, 'beatmapset_id': '1150262', 'date': datetime.datetime(2021, 6, 1, 4, 21, 3, 80000), 'library': '', 'epic_factor': '1'}]
      [{'display_html': "<img src='/images/C_small.png'/> <b><a href='/u/11339405'>GDcheerios</a></b> achieved rank #936 on <a href='/b/1952051?m=0'>frederic - ONLYWONDER [Singing sometimes]</a> (osu!)", 'beatmap_id': 1952051, 'beatmapset_id': '807885', 'date': datetime.datetime(2021, 6, 1, 5, 17, 11, 80000), 'library': '', 'epic_factor': '1'}, {'display_html': "<img src='/images/A_small.png'/> <b><a href='/u/11339405'>GDcheerios</a></b> achieved rank #77 on <a href='/b/2401143?m=0'>Falcom Sound Team jdk - Desert After Tears [Inferno]</a> (osu!)", 'beatmap_id': 2401143, 'beatmapset_id': '1150262', 'date': datetime.datetime(2021, 6, 1, 4, 21, 3, 80000), 'library': '', 'epic_factor': '1'}]
      """
      result = re.findall(r"\>(\w+).+( achieved rank .\w+ on ).+m=0'>(.*? - .*?\])", html)
      
      for each in result:
          print(''.join(each))
      

      输出:

      GDcheerios achieved rank #77 on Falcom Sound Team jdk - Desert After Tears [Inferno]
      GDcheerios achieved rank #77 on Falcom Sound Team jdk - Desert After Tears [Inferno]
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-12-09
        • 2014-05-23
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多