NLTK。检测一个句子是否是疑问句？答案

【问题标题】：NLTK. Detecting whether a sentence is Interrogative or Not?NLTK。检测一个句子是否是疑问句？
【发布时间】：2018-08-12 12:39:04
【问题描述】：

我想使用 NLTK 或任何最能正确识别给定句子是否是疑问句（问题）的库来创建一个 python 脚本。我尝试使用正则表达式，但在更深层次的情况下正则表达式失败。所以想使用自然语言处理，谁能帮忙！

【问题讨论】：

当你说疑问并且你使用了正则表达式时，你是否在寻找比检查标点符号更深入的东西？您可能会发现这很有用stackoverflow.com/questions/17879551/…
我已经浏览了那篇文章，问题是我是初学者，答案的复杂性很高。我正在尝试找到一个简单的解决方案，如果存在的话。
复杂性取决于您对疑问问题的标准，您应该在问题中澄清。如果您只是寻找问号的存在，这很容易。如果您想确定一个问题，而不是寻找标点符号，而是通过寻找疑问词（什么、为什么、如何等），这也不是太难。但是，如果您想确定任何类型的问题（例如“这很好吗”），那么这可能会比较棘手，并且需要一个复杂的解决方案，就像上面的帖子一样。

标签： python machine-learning nlp artificial-intelligence nltk

【解决方案1】：

This 可能会解决您的问题。

代码如下：

import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]


def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

这应该打印出 0.67 之类的东西，这是相当准确的。如果你想通过这个分类器处理一串文本，试试：

print(classifier.classify(dialogue_act_features(line)))

您可以将字符串分类为 ynQuestion、Statement 等，并提取您想要的内容。

这种方法使用的是 NaiveBayes，在我看来这是最简单的，但肯定有很多方法可以处理它。希望这会有所帮助！

【讨论】：

我可以在其中添加自定义训练数据吗？就像我检查了“我需要为 jupyter 使用 anaconda”然后它显示为语句。
你用什么方法来抓取每个“检测到”的问题？

【解决方案2】：

您可以使用 sklearn 库通过简单的 Gradient Boosting 改进 PolkaDot 解决方案并达到 86% 左右的准确率。那会是这样的：

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()


posts_text = [post.text for post in posts]

#divide train and test in 80 20
train_text = posts_text[:int(len(posts_text)*0.8)]
test_text = posts_text[int(len(posts_text)*0.2):]

#Get TFIDF features
vectorizer = TfidfVectorizer(ngram_range=(1,3), 
                             min_df=0.001, 
                             max_df=0.7, 
                             analyzer='word')

X_train = vectorizer.fit_transform(train_text)
X_test = vectorizer.transform(test_text)

y = [post.get('class') for post in posts]

y_train = y[:int(len(posts_text)*0.8)]
y_test = y[int(len(posts_text)*0.2):]

# Fitting Gradient Boosting classifier to the Training set
gb = GradientBoostingClassifier(n_estimators = 400, random_state=0)
#Can be improved with Cross Validation

gb.fit(X_train, y_train)

predictions_rf = gb.predict(X_test)

#Accuracy of 86% not bad
print(classification_report(y_test, predictions_rf))

然后您可以使用该模型通过gb.predict(vectorizer.transform(['new sentence here']) 对新数据进行预测。

【讨论】：

对于这个“jupyter我需要anaconda”它显示，它声明，但这是一个问题
准确率是 86% 而不是 100%
有什么办法可以添加这样的训练数据，这样就可以把这样的问题标记为问题

【解决方案3】：

根据@PolkaDot 的回答，我创建了使用 NLTK 的函数，然后创建了一些自定义代码以获得更高的准确性。

posts = nltk.corpus.nps_chat.xml_posts()[:10000]

def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]

# 10% of the total data
size = int(len(featuresets) * 0.1)

# first 10% for test_set to check the accuracy, and rest 90% after the first 10% for training
train_set, test_set = featuresets[size:], featuresets[:size]

# get the classifer from the training set
classifier = nltk.NaiveBayesClassifier.train(train_set)
# to check the accuracy - 0.67
# print(nltk.classify.accuracy(classifier, test_set))

question_types = ["whQuestion","ynQuestion"]
def is_ques_using_nltk(ques):
    question_type = classifier.classify(dialogue_act_features(ques)) 
    return question_type in question_types

然后

question_pattern = ["do i", "do you", "what", "who", "is it", "why","would you", "how","is there",
                    "are there", "is it so", "is this true" ,"to know", "is that true", "are we", "am i", 
                   "question is", "tell me more", "can i", "can we", "tell me", "can you explain",
                   "question","answer", "questions", "answers", "ask"]

helping_verbs = ["is","am","can", "are", "do", "does"]
# check with custom pipeline if still this is a question mark it as a question
def is_question(question):
    question = question.lower().strip()
    if not is_ques_using_nltk(question):
        is_ques = False
        # check if any of pattern exist in sentence
        for pattern in question_pattern:
            is_ques  = pattern in question
            if is_ques:
                break

        # there could be multiple sentences so divide the sentence
        sentence_arr = question.split(".")
        for sentence in sentence_arr:
            if len(sentence.strip()):
                # if question ends with ? or start with any helping verb
                # word_tokenize will strip by default
                first_word = nltk.word_tokenize(sentence)[0]
                if sentence.endswith("?") or first_word in helping_verbs:
                    is_ques = True
                    break
        return is_ques    
    else:
        return True

你只需要使用is_question方法来检查通过的句子是否是问题。

【讨论】：