预测假新闻与否不适用于新数据答案

【问题标题】：Predicting fake news or not does'nt work well with new data预测假新闻与否不适用于新数据
【发布时间】：2021-03-05 11:56:59
【问题描述】：

我有一个如下所示的数据集：

                     content                          label
0   Sainte-Nathalène – Si les scientifiques sonnen...   1
1   Le musicien américano-néerlandais Eddie Van Ha...   0
2   Angela Merkel écoute Emmanuel Macron, lors d’u...   0
3   Analyse. Telle qu’elle a été présentée, dimanc...   0
4   Sur l’esplanade du Trocadéro, à Paris, 24 août...   0

数据有1000篇假新闻和1000篇真实新闻。

我这样训练模型：

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['content'], df['label'], test_size=0.20)

# Random forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# Vectorizing and applying TF-IDF

pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', RandomForestClassifier())
])

# Fitting the model
model = pipeline.fit(X_train, y_train)
# Accuracy
from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrix
prediction = model.predict(X_test)
print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))

准确率：95.95%

rf_cm = metrics.confusion_matrix(y_test, prediction)
print(rf_cm)

[[193 18] [0 233]]

所以模型训练有素。

我做了 model.pickle 来使用 Flask 中的模型。

当我将此模型用于一篇新文章时，它总是会预测一篇假文章。即使文章是真实的。

flask 应用中的model.py 是这样的：

import pickle
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

news = pd.read_csv('news2.csv')
X = news['content']
y = news['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

# Random forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# Vectorizing and applying TF-IDF

pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', RandomForestClassifier())
])

# Fitting the model
model = pipeline.fit(X_train, y_train)

# Accuracy
from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrix
prediction = model.predict(X_test)
print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))

rf_cm = metrics.confusion_matrix(y_test, prediction)
print(rf_cm)

#Serialize the file
with open('model.pickle', 'wb') as handle:
    pickle.dump(pipeline, handle, protocol=pickle.HIGHEST_PROTOCOL)

在 routes.py 我这样做了：

# Receiving the input url from the user and using Web Scraping to extract the news content
@app.route('/predict', methods=['GET', 'POST'])
def predict():
    url = request.get_data(as_text=True)[5:]
    url = urllib.parse.unquote(url)
    article = Article(str(url))
    article.download()
    article.parse()
    article.nlp()
    news = article.summary
    # Passing the news article to the model and returing whether it is Fake or Real
    pred = model.predict([news])
    dic = {1:'Fake',0:'Real'}
    return render_template('home.html', prediction_text='The news is "{}"'.format(dic[pred[0]]))

可能是什么原因？如何使用经过训练的模型获得更好的新数据结果。

【问题讨论】：

您在烧瓶中的模型是否正确标记了测试数据中的样本？
我用我在烧瓶应用程序中所做的 model.py 更新了帖子。如何检查烧瓶中的模型是否正确标记了测试数据中的样本？
对不起，我想我不够清楚：使用烧瓶时，它是否可以按预期处理训练或测试数据？
是的，它运作良好。原因可能是我没有足够的数据训练我的模型吗？
您无法使用 BOW 模型检测假新闻。您需要提取事实并将其与真实数据库进行核对。

标签： python machine-learning scikit-learn random-forest

【解决方案1】：

检测假新闻很困难。它需要很多关于世界的知识，而不仅仅是关于出现单词的一些概率。几年前，文章“美国总统建议用核武器制造飓风”显然会被很多人贴上“假新闻”的标签。但今天？ Not so sure...

您的模型似乎很适合您的数据集。但它真的代表你的问题吗？你的模型确实学到了一些东西，但它学到了什么？可能在数据集中出现了某些短语来表示“真实”新闻，但在网站文章中却没有？可能反过来？

另外，您是否在抓取后检查过您是否正确预处理了数据？数据中是否还有 html-tags 或类似的人工制品？这也可能对分类器产生影响。

但总而言之，我会对一个仅用 2000 个样本就学会检测假新闻的模型印象深刻。即使对人类专家来说，事实核查也是一项艰巨的任务！

【讨论】：

谢谢，所以我要找一个包含 40 000 篇文章的数据集。 Kaggle 中有一些。我觉得会好一点