更快地计算字符串中的短语数答案

【问题标题】：count the number of phrases in a string faster更快地计算字符串中的短语数
【发布时间】：2015-10-30 07:24:53
【问题描述】：

我有以下代码从字符串中计算所需的短语：

from nltk.util import ngrams
from nltk import word_tokenize
import pandas as pd

def count_words(convo,df_search):
    for i in range(0,len(df_search)):
        word = df_search['word'][i] #set the word
        a=tuple(word.split(' '))
        print word, len([i for i in ngrams(word_tokenize(convo),n=len(a)) if i==a]) 

convo="I see a tall tree outside. A man is under the tall tree. Actually, there are more than one man under the tall tree"

df_search=pd.DataFrame({'word':['man','tall tree','is under the']})

count_words(convo,df_search)

代码的问题在于它真的很慢，它每次都“重新”ngrams 来寻找一个新的短语。而那部分短语是动态的，所以我不知道长度有多长。需要帮助更改代码以加快速度。

【问题讨论】：

标签： python string substring nltk n-gram

【解决方案1】：

如果您不介意使用re：

import re
input_string = "I see a tall tree outside. A man is under the tall tree. Actually, there are more than one man under the tall tree"
word = ['man','tall tree','is under the']
for i in word:
    print i + ': ' + str(sum(1 for _ in re.finditer(r'\b%s\b' % re.escape(i), input_string)))

【讨论】：

【解决方案2】：

你能把你的print语句替换成

print word, convo.count(word)

【讨论】：

@alvas 罗伯特的回答有什么问题吗？这看起来很简单，为什么ngrams 或everygrams？
当然，例如字符串“stall treets”（无论是什么）也可以用这种简单的方法算作“tall tree”。

【解决方案3】：

鉴于 NLTK 的前沿版本，有一个 everygrams 实现，https://github.com/nltk/nltk/blob/develop/nltk/util.py#L464

之后您可以简单地进行计数：

>>> from nltk import word_tokenize
>>> from nltk.util import everygrams
>>> sent = word_tokenize("I see a tall tree outside. A man is under the tall tree. Actually, there are more than one man under the tall tree")
>>> ng = tuple(['tall', 'tree'])
>>> list(everygrams(sent)).count(ng)
3

如果没有，您可以随时创建自己的everygrams 函数（只需从https://github.com/nltk/nltk/blob/develop/nltk/util.py#L464 剪切和粘贴）然后进行计数=）

【讨论】：