【发布时间】:2015-10-30 07:24:53
【问题描述】:
我有以下代码从字符串中计算所需的短语:
from nltk.util import ngrams
from nltk import word_tokenize
import pandas as pd
def count_words(convo,df_search):
for i in range(0,len(df_search)):
word = df_search['word'][i] #set the word
a=tuple(word.split(' '))
print word, len([i for i in ngrams(word_tokenize(convo),n=len(a)) if i==a])
convo="I see a tall tree outside. A man is under the tall tree. Actually, there are more than one man under the tall tree"
df_search=pd.DataFrame({'word':['man','tall tree','is under the']})
count_words(convo,df_search)
代码的问题在于它真的很慢,它每次都“重新”ngrams 来寻找一个新的短语。而那部分短语是动态的,所以我不知道长度有多长。需要帮助更改代码以加快速度。
【问题讨论】:
标签: python string substring nltk n-gram