scikit-learn 添加训练数据答案

【问题标题】：scikit-learn add training datascikit-learn 添加训练数据
【发布时间】：2016-11-26 21:00:09
【问题描述】：

我正在查看sklearn here 中提供的训练数据。根据文档，它包含 20 类文档，基于一些新闻组集合。它在对属于这些类别的文档进行分类方面做得相当好。但是，我需要为类别添加更多文章，例如板球、足球、核物理等。

我已经为每个类准备了一组文档，例如sports -> cricket、cooking -> French 等。如何在sklearn 中添加这些文档和类，以便现在返回 20 个类的接口将返回这 20 个加上新的？如果我需要通过SVM 或Naive Bayes 进行一些培训，在将其添加到数据集之前我应该在哪里进行？

【问题讨论】：

能否请您上传您的代码以及您遇到的问题？
我并没有真正卡在任何地方，没有代码可以显示！我只是想知道如何在skilearn 已经提供的20 类文档中添加更多的训练数据（文档和随附的类）。

标签： python machine-learning scipy scikit-learn

【解决方案1】：

假设您的附加数据具有以下目录结构（如果没有，那么这应该是您的第一步，因为您可以使用sklearn API 来获取数据，这将使您的生活更轻松，请参阅@987654321 @):

additional_data
      |
      |-> sports.cricket
                |
                |-> file1.txt
                |-> file2.txt
                |-> ...
      |
      |-> cooking.french
                |
                |-> file1.txt
                |-> ...
       ...

移动到python，加载两个数据集（假设您的附加数据采用上述格式并植根于/path/to/additional_data）

import os

from sklearn import cross_validation
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets import load_files
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
import numpy as np

# Note if you have a pre-defined training/testing split in your additional data, you would merge them with the corresponding 'train' and 'test' subsets of 20news
news_data = fetch_20newsgroups(subset='all')
additional_data = load_files(container_path='/path/to/additional_data', encoding='utf-8')

# Both data objects are of type `Bunch` and therefore can be relatively straightforwardly merged

# Merge the two data files
'''
The Bunch object contains the following attributes: `dict_keys(['target_names', 'description', 'DESCR', 'target', 'data', 'filenames'])`
The interesting ones for our purposes are 'data' and 'filenames'
'''
all_filenames = np.concatenate((news_data.filenames, additional_data.filenames)) # filenames is a numpy array
all_data = news_data.data + additional_data.data # data is a standard python list

merged_data_path = '/path/to/merged_data'

'''
The 20newsgroups data has a filename a la '/path/to/scikit_learn_data/20news_home/20news-bydate-test/rec.sport.hockey/54367'
So depending on whether you want to keep the sub directory structure of the train/test splits or not, 
you would either need the last 2 or 3 parts of the path
'''
for content, f in zip(all_data, all_filenames):
    # extract sub path
    sub_path, filename = f.split(os.sep)[-2:]

    # Create output directory if not exists
    p = os.path.join(merged_data_path, sub_path)
    if (not os.path.exists(p)):
        os.makedirs(p)

    # Write data to file
    with open(os.path.join(p, filename), 'w') as out_file:
        out_file.write(content)

# Now that everything is stored at `merged_data_path`, we can use `load_files` to fetch the dataset again, which now includes everything from 20newsgroups and your additional data
all_data = load_files(container_path=merged_data_path)

'''
all_data is yet another `Bunch` object:
    * `data` contains the data
    * `target_names` contains the label names
    * `target contains` the labels in numeric format
    * `filenames` contains the paths of each individual document

thus, running a classifier over the data is straightforward
'''
vec = CountVectorizer()
X = vec.fit_transform(all_data.data)

# We want to create a train/test split for learning and evaluating a classifier (supposing we haven't created a pre-defined train/test split encoded in the directory structure)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, all_data.target, test_size=0.2)

# Create & fit the MNB model
mnb = MultinomialNB()
mnb.fit(X_train, y_train)

# Evaluate Accuracy
y_predicted = mnb.predict(X_test)

print('Accuracy: {}'.format(accuracy_score(y_test, y_predicted)))

# Alternatively, the vectorisation and learning can be packaged into a pipeline and serialised for later use
pipeline = Pipeline([('vec', CountVectorizer()), ('mnb', MultinomialNB())])

# Run the vectorizer and train the classifier on all available data
pipeline.fit(all_data.data, all_data.target)

# Serialise the classifier to disk
joblib.dump(pipeline, '/path/to/model_zoo/mnb_pipeline.joblib')

# If you get some more data later on, you can deserialise the model and run them through the pipeline again
p = joblib.load('/path/to/model_zoo/mnb_pipeline.joblib')

docs_new = ['God is love', 'OpenGL on the GPU is fast']

y_predicted = p.predict(docs_new)
print('Predicted labels: {}'.format(np.array(all_data.target_names)[y_predicted]))

【讨论】：

哇。这看起来很有希望。几个问题 - 我如何处理最终的 all_data 变量？我的意思是，这个例子 - scikit-learn.org/stable/tutorial/text_analytics/… 展示了如何对文档进行分类。在这种情况下如何使用上面获得的all_date？其次，你回答的最后一部分对我来说有点不清楚，多一点解释会很好。
@AttitudeMonger 我已经更新了我的答案——你能简单解释一下哪一部分不清楚吗？
哇。它一直在变得越来越好！这值得超过 50 的赏金，我将来会奖励它更多..:) 回来，我不明白，我在上面的哪个地方放置我的数据以针对现有数据集对其进行测试？例如，在我上面提供的链接中，docs_new = ['God is love', 'OpenGL on the GPU is fast'] 加载了我要为其查找匹配类别的文本数据。对于列表中的每个文本，我都会得到结果。在您的示例中，它是如何完成的？
@AttitudeMonger 你想在现有的 20newsgroups 数据上训练模型并对新数据进行分类，我理解正确吗？
嗯，不，不，不是那样。我有 20 个现有类别。我想再添加 40 个类别，我将为每个类别添加相应的文档。所以现在我有60个类别。我要为其查找类别的任何文本，如果它属于这 60 个类别中的任何一个（包括我添加的 40 个新类别），我应该得到结果类别及其准确性。