【发布时间】:2021-03-05 11:56:59
【问题描述】:
我有一个如下所示的数据集:
content label
0 Sainte-Nathalène – Si les scientifiques sonnen... 1
1 Le musicien américano-néerlandais Eddie Van Ha... 0
2 Angela Merkel écoute Emmanuel Macron, lors d’u... 0
3 Analyse. Telle qu’elle a été présentée, dimanc... 0
4 Sur l’esplanade du Trocadéro, à Paris, 24 août... 0
数据有1000篇假新闻和1000篇真实新闻。
我这样训练模型:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['content'], df['label'], test_size=0.20)
# Random forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
# Vectorizing and applying TF-IDF
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('model', RandomForestClassifier())
])
# Fitting the model
model = pipeline.fit(X_train, y_train)
# Accuracy
from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrix
prediction = model.predict(X_test)
print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
准确率:95.95%
rf_cm = metrics.confusion_matrix(y_test, prediction)
print(rf_cm)
[[193 18] [0 233]]
所以模型训练有素。
我做了 model.pickle 来使用 Flask 中的模型。
当我将此模型用于一篇新文章时,它总是会预测一篇假文章。即使文章是真实的。
flask 应用中的model.py 是这样的:
import pickle
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
news = pd.read_csv('news2.csv')
X = news['content']
y = news['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
# Random forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
# Vectorizing and applying TF-IDF
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('model', RandomForestClassifier())
])
# Fitting the model
model = pipeline.fit(X_train, y_train)
# Accuracy
from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrix
prediction = model.predict(X_test)
print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
rf_cm = metrics.confusion_matrix(y_test, prediction)
print(rf_cm)
#Serialize the file
with open('model.pickle', 'wb') as handle:
pickle.dump(pipeline, handle, protocol=pickle.HIGHEST_PROTOCOL)
在 routes.py 我这样做了:
# Receiving the input url from the user and using Web Scraping to extract the news content
@app.route('/predict', methods=['GET', 'POST'])
def predict():
url = request.get_data(as_text=True)[5:]
url = urllib.parse.unquote(url)
article = Article(str(url))
article.download()
article.parse()
article.nlp()
news = article.summary
# Passing the news article to the model and returing whether it is Fake or Real
pred = model.predict([news])
dic = {1:'Fake',0:'Real'}
return render_template('home.html', prediction_text='The news is "{}"'.format(dic[pred[0]]))
可能是什么原因?如何使用经过训练的模型获得更好的新数据结果。
【问题讨论】:
-
您在烧瓶中的模型是否正确标记了测试数据中的样本?
-
我用我在烧瓶应用程序中所做的 model.py 更新了帖子。如何检查烧瓶中的模型是否正确标记了测试数据中的样本?
-
对不起,我想我不够清楚:使用烧瓶时,它是否可以按预期处理训练或测试数据?
-
是的,它运作良好。原因可能是我没有足够的数据训练我的模型吗?
-
您无法使用 BOW 模型检测假新闻。您需要提取事实并将其与真实数据库进行核对。
标签: python machine-learning scikit-learn random-forest