【问题标题】:How to get the label predicted using sklearn and numpy?如何使用 sklearn 和 numpy 获得预测的标签?
【发布时间】:2020-05-18 04:42:04
【问题描述】:

我正在尝试使用 sklearn 使用文件夹来预测一些文本,其中每个子文件夹都是 txt 文件的集合:

import numpy
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from nltk.corpus import stopwords
from sklearn import svm
import os

path = 'C:\wamp64\www\machine_learning\webroot\mini_iniciais\\'

#carregando
data = load_files(path, encoding="utf-8", decode_error="replace")
labels, counts = numpy.unique(data.target, return_counts=True)
labels_str = numpy.array(data.target_names)[labels]
print(dict(zip(labels_str, counts)))

#montando
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)
vectorizer = TfidfVectorizer(max_features=1000, decode_error="ignore")
vectorizer.fit(X_train)
X_train_vectorized = vectorizer.transform(X_train)

cls = MultinomialNB()
cls.fit(vectorizer.transform(X_train), y_train)

texts_to_predict = ["medicamento"]

result = cls.predict(vectorizer.transform(texts_to_predict))
print(result)

这是print(dict(zip(labels_str, counts)))的结果:

{'PG16-PROCURADORIA-DE-SERVICOS-DE-SAUDE': 10, 'PP-PROCURADORIA-DE-PESSOAL-PG04': 10, 'PPMA-PROCURADORIA-DE-PATRIMONIO-E-MEIO-AMBIENTE-PG06': 10, 'PPREV-PROCURADORIA-PREVIDENCIARIA-PG07': 10, 'PSP-PROCURADORIA-DE-SERVICOS-PUBLICOS-PG08': 10, 'PTRIB-PROCURADORIA-TRIBUTARIA-PG03': 10}

cls.predict 的结果只是数组上的一个 int:

[0]

当我更改 texts_to_predict 值时,甚至是 [1]、[3] 等。

那么,我怎样才能得到其中一个子文件夹的名称作为预测结果呢?

【问题讨论】:

    标签: python numpy machine-learning scikit-learn text-classification


    【解决方案1】:

    根据the documentation of load_files,返回的data的属性target_names成立

    [t]目标类的名称。

    所以,考虑使用类似的东西

    print([data.target_names[x] for x in result])
    

    而不是

    print(result)
    

    【讨论】:

      猜你喜欢
      • 2020-06-01
      • 2019-03-12
      • 2018-05-07
      • 1970-01-01
      • 1970-01-01
      • 2016-09-10
      • 2015-12-04
      • 2016-11-19
      • 2015-05-15
      相关资源
      最近更新 更多