【问题标题】:Remove unwanted characters from set of strings in python从python中的字符串集中删除不需要的字符
【发布时间】:2021-06-11 10:42:00
【问题描述】:

我正在尝试清理一组字符串以删除不需要的字符。

输入

Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . Alex Jary7 .
Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . Cam Hardie . C5
Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .
Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harker . Connor Beasley .
One Night Stand 0 0 D 34 W Jarvis . Silvestre De Sousa . 30 C1 C5
Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jamie Spencer . 30
Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew Mullen . 14

想要的输出

Lethal Lunch
Muscika
Typhoon Ten
Wentworth Falls
One Night Stand
Dancinginthewoods 
Case Key

我试过了

re.findall('([a-zA-Z ]*)\d*.*',final_df.loc[index, 'Horse'])

这会删除数字之后的所有内容,但会将 t 保留在第一个条目上。我想知道是否有更好的方法?

【问题讨论】:

    标签: python string python-re


    【解决方案1】:

    我会改用re.split

    for d in data.splitlines():
        print(re.split(r'\s+t?[0-9]\+?', d)[0])
    
    结果
    Lethal Lunch 
    Muscika 
    Typhoon Ten 
    Wentworth Falls 
    One Night Stand 
    Dancinginthewoods 
    Case Key 
    

    解释:它在指定模式匹配的地方分割字符串,然后取第一部分。您可能想要调整它以便其他模式也匹配。

    在熊猫中

    我刚刚注意到您似乎在使用 Pandas - 假设您的 df 看起来像这样:

                                                   Horse
    0  Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . A...
    1  Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . ...
    2  Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .
    3  Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harke...
    4  One Night Stand 0 0 D 34 W Jarvis . Silvestre ...
    5  Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jami...
    6  Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew M...
    

    你可以的

    from operator import itemgetter
    
    df["name"] = df.Horse.str.split('\s+t?[0-9]\+?').map(itemgetter(0))
    

    得到这个:

                                                   Horse               name
    0  Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . A...       Lethal Lunch
    1  Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . ...            Muscika
    2  Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .        Typhoon Ten
    3  Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harke...    Wentworth Falls
    4  One Night Stand 0 0 D 34 W Jarvis . Silvestre ...    One Night Stand
    5  Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jami...  Dancinginthewoods
    6  Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew M...           Case Key
    
    

    【讨论】:

    • 太棒了,感谢您添加 pandas 位,这真的很有帮助。谢谢
    【解决方案2】:

    这样的事情应该可以工作:

    filtered_text = list()
    
    for line in text:
        part = ""
        for word in text.split(" "):
            if len(word) <= 3:
                break
            else:
                part = str(part) + " " + str(word)
    
        part = part[1:] # skip first space
        filtered_text.append(part)
    

    【讨论】:

      【解决方案3】:

      这样就够了吗?

      input = [
          "Lethal Lunch t5+ 0 0 D 10 t5+ Michael Bell . Alex Jary7 .",
          "Muscika 1 v5+ W5+ 0 0 D 5 v5+ W5+ D O'Meara . Cam Hardie . C5",
          "Typhoon Ten 1 0 0 D 13 R Hannon . Luke Catton7 .",
          "Wentworth Falls 1 cp5+ 0 0 C D 45 cp5+ G Harker . Connor Beasley .",
          "One Night Stand 0 0 D 34 W Jarvis . Silvestre De Sousa . 30 C1 C5",
          "Dancinginthewoods 1 0 0 D 24 D Ivory . 14 Jamie Spencer . 30",
          "Case Key 1 v3 0 0 D 13 v3 M Appleby . Andrew Mullen . 14",
      ]
      
      for inp in input:
          print(re.findall(r'\b[a-zA-Z ]+\b', inp)[0])
      

      我们基本上会忽略带有数字或奇怪符号的单词。 输出:

      Lethal Lunch 
      Muscika 
      Typhoon Ten 
      Wentworth Falls 
      One Night Stand 
      Dancinginthewoods 
      Case Key 
      

      【讨论】:

        猜你喜欢
        • 2011-02-16
        • 2011-12-24
        • 2011-08-13
        • 2016-06-29
        • 2015-08-06
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多