【问题标题】:Finding whether a sentence is positive, neutral or negative?找出一个句子是积极的、中性的还是消极的?
【发布时间】:2025-11-23 07:10:01
【问题描述】:

我想创建一个脚本来判断一个句子是积极的还是中性的还是消极的。

我在网上搜索发现通过medium article可以使用NLTK库来完成。

所以,我已经尝试过这段代码。

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews


def extract_features(word_list):
    return dict([(word, True) for word in word_list])


if __name__ == '__main__':
    # Load positive and negative reviews
    positive_fileids = movie_reviews.fileids('pos')
    negative_fileids = movie_reviews.fileids('neg')

    features_positive = [(extract_features(movie_reviews.words(fileids=[f])),
                          'Positive') for f in positive_fileids]
    features_negative = [(extract_features(movie_reviews.words(fileids=[f])),
                          'Negative') for f in negative_fileids]

    # Split the data into train and test (80/20)
    threshold_factor = 0.8
    threshold_positive = int(threshold_factor * len(features_positive))
    threshold_negative = int(threshold_factor * len(features_negative))

    features_train = features_positive[:threshold_positive] + features_negative[:threshold_negative]
    features_test = features_positive[threshold_positive:] + features_negative[threshold_negative:]
    print("\nNumber of training datapoints:", len(features_train))
    print("Number of test datapoints:", len(features_test))

    # Train a Naive Bayes classifier
    classifier = NaiveBayesClassifier.train(features_train)
    print("\nAccuracy of the classifier:", nltk.classify.util.accuracy(classifier, features_test))

    print("\nTop 10 most informative words:")
    for item in classifier.most_informative_features()[:10]:
        print(item[0])

    # Sample input reviews
    input_reviews = [
    "Started off as the greatest series of all time, but had the worst ending of all time.",
    "Exquisite. 'Big Little Lies' takes us to an incredible journey with its emotional and intriguing storyline.",
    "I love *lyn 99 so much. It has the best crew ever!!",
    "The Big Bang Theory and to me it's one of the best written sitcoms currently on network TV.",
    "'Friends' is simply the best series ever aired. The acting is amazing.",
    "SUITS is smart, sassy, clever, sophisticated, timely and immensely entertaining!",
    "Cumberbatch is a fantastic choice for Sherlock Holmes-he is physically right (he fits the traditional reading of the character) and he is a damn good actor",
    "What sounds like a typical agent hunting serial killer, surprises with great characters, surprising turning points and amazing cast."
    "This is one of the most magical things I have ever had the fortune of viewing.",
    "I don't recommend watching this at all!"
    ]

    print("\nPredictions:")
    for review in input_reviews:
        print("\nReview:", review)
        probdist = classifier.prob_classify(extract_features(review.split()))
        pred_sentiment = probdist.max()
        print("Predicted sentiment:", pred_sentiment)
        print("Probability:", round(probdist.prob(pred_sentiment), 2))

这是我得到的输出

Number of training datapoints: 1600
Number of test datapoints: 400

Accuracy of the classifier: 0.735

Top 10 most informative words:
outstanding
insulting
vulnerable
ludicrous
uninvolving
avoids
astounding
fascination
affecting
seagal

Predictions:

Review: Started off as the greatest series of all time, but had the worst ending of all time.
Predicted sentiment: Negative
Probability: 0.64

Review: Exquisite. 'Big Little Lies' takes us to an incredible journey with its emotional and intriguing storyline.
Predicted sentiment: Positive
Probability: 0.89

Review: I love *lyn 99 so much. It has the best crew ever!!
Predicted sentiment: Negative
Probability: 0.51

Review: The Big Bang Theory and to me it's one of the best written sitcoms currently on network TV.
Predicted sentiment: Positive
Probability: 0.62

Review: 'Friends' is simply the best series ever aired. The acting is amazing.
Predicted sentiment: Positive
Probability: 0.55

Review: SUITS is smart, sassy, clever, sophisticated, timely and immensely entertaining!
Predicted sentiment: Positive
Probability: 0.82

Review: Cumberbatch is a fantastic choice for Sherlock Holmes-he is physically right (he fits the traditional reading of the character) and he is a damn good actor
Predicted sentiment: Positive
Probability: 1.0

Review: What sounds like a typical agent hunting serial killer, surprises with great characters, surprising turning points and amazing cast.This is one of the most magical things I have ever had the fortune of viewing.
Predicted sentiment: Positive
Probability: 0.95

Review: I don't recommend watching this at all!
Predicted sentiment: Negative
Probability: 0.53

Process finished with exit code 0

我面临的问题是数据集非常有限,因此输出精度非常低。有没有更好的图书馆或资源或其他任何东西来检查一个陈述是积极的、中性的还是消极的?

更具体地说,我想把它应用到一般的日常谈话中

【问题讨论】:

  • 网上有很多情绪分析数据集可供您使用。否则,您可以从网站或使用 twitter API 抓取 cmets。
  • 感谢您让我了解 twitter API...搞清楚它...谢谢
  • 你好,尝试了 VADER 情感分析...得到了比上面的代码更好的结果...所以,只想问一下,哪个是击球手 textblob 或 VADER?
  • 对于一个只有 2000 条记录的小数据集,实际上并不是哪个包或分类器更好。你可以看到你的分类器学到的几个“信息量最大的 10 个词”中的几个词甚至都没有情感:“seagal”只是演员/导演的名字,“避免”、“着迷”、“侮辱”是没有意义的。

标签: python machine-learning nlp nltk sentiment-analysis


【解决方案1】:

亚马逊客户评论数据集是一个庞大的数据集,包含 130 多万条客户评论。您可以通过匹配评论和评级将其用于情绪分析。这么多数据也非常适合超级花哨的数据密集型深度学习方法。

(https://s3.amazonaws.com/amazon-reviews-pds/readme.html)

如果您特别搜索电影评论,大型电影评论数据集也是一个选择,其中包含 50K+ IMDB 评论。 (http://ai.stanford.edu/~amaas/data/sentiment/)

我建议使用词嵌入而不是单热编码的词袋来增强您的模型。

【讨论】:

  • 根据你的说法,从 textblob、VADER 或其他任何东西开始哪种情绪分析比较好?
  • 这取决于您的需求以及您在现实世界中的使用位置。 VADER 使用基于规则的方法,当您的领域缺乏数据时,该方法可能优于基于学习的方法。另一方面,Textblob 使用预训练的(在电影评论中)NaiveBayes 分类器。如果使用更大的数据集进行训练,它可以比您的代码更好地工作。因此,最好的方法是尝试不同的方法,以根据您的要求获得最佳结果。
【解决方案2】:

已经有一些可用的语料库:

英语:

1) 多域情感分析数据集:http://www.cs.jhu.edu/~mdredze/datasets/sentiment/

2) IMDB 评论:http://ai.stanford.edu/~amaas/data/sentiment/

3) 斯坦福情绪树库:http://nlp.stanford.edu/sentiment/code.html

4) Sentiment140:http://help.sentiment140.com/for-students/

5) Twitter 美国航空公司情绪:https://www.kaggle.com/crowdflower/twitter-airline-sentiment

等等:50 free Machine Learning datasets: Sentiment Analysis 和这里:nlpprogress

中文:

7) THUCNews:http://thuctc.thunlp.org/

8) 今日头条:https://github.com/fate233/toutiao-text-classfication-dataset

9) 搜狗网:https://www.sogou.com/labs/resource/ca.php

10) 搜狗:https://www.sogou.com/labs/resource/cs.php

等等here.

一旦您的数据集足够大,您就可以使用判别模型,因为对于小数据集,生成模型可以防止过度拟合,而对于大数据集,判别模型可以捕获生成模型无法捕获的依赖关系(详见 here)。

It's 说用树形结构来建模情绪会更好,数据不多,那么我想我们可以考虑the tree-structured LSTM

【讨论】: