从列表中删除以某些表达式开头的字符串答案

【问题标题】：Removing strings from list that start with certain expressions从列表中删除以某些表达式开头的字符串
【发布时间】：2019-08-11 23:20:58
【问题描述】：

我有一个与 twitter 主题标签相关的字符串列表。我想删除整个以特定前缀开始的字符串。

例如：

testlist = ['Just caught up with #FlirtyDancing. Just so cute! Loved it. ', 'After work drinks with this one @MrLukeBenjamin no dancing tonight though @flirtydancing @AshleyBanjo #FlirtyDancing pic.twitter.com/GJpRUZxUe8', 'Only just catching up and @AshleyBanjo you are gorgeous #FlirtyDancing', 'Loved working on this. Always a pleasure getting to assist the wonderful @kendrahorsburgh on @ashleybanjogram wonderful new show !! #flirtydancing pic.twitter.com/URMjUcgmyi', 'Just watching #FlirtyDancing & \n@AshleyBanjo what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up.. pic.twitter.com/iwCLRmAi5n',]

我想删除图片 URL、主题标签和 @ 的

到目前为止，我已经尝试了一些方法，即使用startswith() 方法和replace() 方法。

例如：

prefixes = ['pic.twitter.com', '#', '@']
bestlist = []

for line in testlist:
    for word in prefixes:
        line = line.replace(word,"")
        bestlist.append(line)

这似乎摆脱了“pic.twitter.com”，但不是 URL 末尾的一系列字母和数字。这些字符串是动态的，每次都会有不同的结束 URL...这就是为什么如果它们以该前缀开头，我想去掉整个字符串。

我也尝试对所有内容进行标记，但 replace() 仍然无法摆脱整个单词：

import nltk 

for line in testlist:
tokens = nltk.tokenize.word_tokenize(line)
for token in tokens:
    for word in prefixes:
        if token.startswith(word):
            token = token.replace(word,"")
            print(token)

我开始对startswith() 方法和replace() 方法失去希望，觉得我可能用这两个方法找错了。

有没有更好的方法来解决这个问题？我怎样才能达到删除所有以#、@和pic.twitter开头的字符串的预期结果？

【问题讨论】：

标签： python string data-cleaning

【解决方案1】：

您可以使用正则表达式来指定要替换的单词类型并使用re.sub

import re

testlist = ['Just caught up with #FlirtyDancing. Just so cute! Loved it. ', 'After work drinks with this one @MrLukeBenjamin no dancing tonight though @flirtydancing @AshleyBanjo #FlirtyDancing pic.twitter.com/GJpRUZxUe8', 'Only just catching up and @AshleyBanjo you are gorgeous #FlirtyDancing', 'Loved working on this. Always a pleasure getting to assist the wonderful @kendrahorsburgh on @ashleybanjogram wonderful new show !! #flirtydancing pic.twitter.com/URMjUcgmyi', 'Just watching #FlirtyDancing & \n@AshleyBanjo what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up.. pic.twitter.com/iwCLRmAi5n',]
regexp = r'pic\.twitter\.com\S+|@\S+|#\S+'

res = [re.sub(regexp, '', sent) for sent in testlist]
print(res)

输出

Just caught up with  Just so cute! Loved it. 
After work drinks with this one  no dancing tonight though    
Only just catching up and  you are gorgeous 
Loved working on this. Always a pleasure getting to assist the wonderful  on  wonderful new show !!  
Just watching  & 
 what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up..

【讨论】：

【解决方案2】：

此解决方案不使用正则表达式或任何其他导入。

prefixes = ['pic.twitter.com', '#', '@']
testlist = ['Just caught up with #FlirtyDancing. Just so cute! Loved it. ', 'After work drinks with this one @MrLukeBenjamin no dancing tonight though @flirtydancing @AshleyBanjo #FlirtyDancing pic.twitter.com/GJpRUZxUe8', 'Only just catching up and @AshleyBanjo you are gorgeous #FlirtyDancing', 'Loved working on this. Always a pleasure getting to assist the wonderful @kendrahorsburgh on @ashleybanjogram wonderful new show !! #flirtydancing pic.twitter.com/URMjUcgmyi', 'Just watching #FlirtyDancing & \n@AshleyBanjo what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up.. pic.twitter.com/iwCLRmAi5n',]


def iter_tokens(line):
    for word in line.split():
        if not any(word.startswith(prefix) for prefix in prefixes):
            yield word

for line in testlist:
    row = list(iter_tokens(line))
    print(' '.join(row))

这会产生以下结果：

python test.py 
Just caught up with Just so cute! Loved it.
After work drinks with this one no dancing tonight though
Only just catching up and you are gorgeous
Loved working on this. Always a pleasure getting to assist the wonderful on wonderful new show !!
Just watching & what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up..

【讨论】：

【解决方案3】：

prefixes = {'pic.twitter.com', '#', '@'} # use sets for faster lookups

def clean_tweet(tweet):
    return " ".join(for word in line.split() if (word[:15] not in prefixes) or (word[0] not in prefixes))

或者看看：

https://www.nltk.org/api/nltk.tokenize.html

TweetTokenizer 可以解决你的大部分问题。

【讨论】：

【解决方案4】：

您需要使用正则表达式而不是静态字符串进行匹配。 replace 无法识别正则表达式。您需要改用re.sub。要从单个字符串 s 中删除您所描述的网址，您需要以下内容：

import re
re.sub('pic\.twitter\.com[^a-zA-Z0-9,.\-!/()=?`*;:_{}\[\]\|~%-]*', '', s)

要匹配标签、回复和网址，您可以执行连续的sub 操作，或将所有正则表达式组合成一个表达式。如果你有很多模式，前者更好，并且应该与re.compile结合使用。

请注意，这只会匹配带有域 twitter.com 和子域 pic 的 url。要匹配任何 url，您必须使用适当的匹配模式来扩充正则表达式。可能会看到this post。

编辑：根据RFC 3986 和I.Am.A.Guy 的评论对正则表达式进行泛化。

【讨论】：

不错的收获。更新为更强大的正则表达式。