【发布时间】:2021-11-14 10:52:44
【问题描述】:
我尝试将filtered_sentence 附加到列表wiki_train_lst。我发现删除stop_words的步骤很快,但是删除common_name很慢(可能是common_name的字太多)。如何快速过滤掉stop_words 和common_name?另外wiki_train_lst要追加的总内容大约是416,000条,这使得追加过程很慢:如何优化呢?
from nltk.tokenize import RegexpTokenizer
wiki_train_lst = []
for text in wiki_train_df.original_text:
tokenizer = RegexpTokenizer(r'\w+')
tokenizer = tokenizer.tokenize(text)
#print(word_tokens)
filtered_sentence = [w.lower() for w in tokenizer if not w.lower() in stop_words] #remove stop words
#filtered_sentence = [w for w in filtered_sentence if not w in common_surname_lst or not w in common_name_lst]
filtered_sentence = [w for w in filtered_sentence if not w in common_name_lst] #remove common names
filtered_sentence = [w for w in filtered_sentence if w.isalpha()] #remove non alphabatics words
wiki_train_lst.append(filtered_sentence)
#print(filtered_sentence)
wiki_train_lst
【问题讨论】:
-
如果您从使用列表推导转换为生成器,是否有帮助?也就是说,将您的
[... for ... in ... if... ]换成(... for ... in ... if...)?这将延迟计算,直到您对其进行具体化。 -
一种可能的优化是使用集合而不是列表,因为集合具有更快的成员资格测试
标签: python-3.x list dataframe append nltk