如何对数据集进行分类？答案

【问题标题】：How can I classify datasets?如何对数据集进行分类？
【发布时间】：2016-11-30 04:52:48
【问题描述】：

如何使用以下训练数据将新数据集分为 A 类和 B 类？

            1.0  0.9  0.8  0.7  0.6  0.5  0.4  0.3  0.2  0.1  class
Dataset 1   42   13   22   324  270  96   107  93   80   228    A
Dataset 2   45   23   14   596  445  135  153  124  132  331    A
Dataset 3   42   22   16   479  407  130  150  121  128  342    A

Dataset 4   37   63   10   481  397  155  143  159  172  394    B
Dataset 5   46   18   10   387  356  127  118  129  136  359    B
Dataset 6   23   34   9    550  436  147  166  164  208  467    B

如果有一个方程可以划分数据集，那将是非常理想的。

例如，如果 # of 1.0 + # of 0.9 高于 55，则为 A 类。（这可能是错误的，但类似这样）

【问题讨论】：

想到的第一个想法：使用 bagging/boosting 让 10 个分类器中的每一个根据平均值进行投票。 stats.stackexchange.com/questions/18891/…

标签： machine-learning classification

【解决方案1】：

如果你熟悉分类任务，几乎所有的分类算法都可以完成这个任务，比如 SVM、NN、C4.5、ID3、随机森林等等。

但对于公式化看一下逻辑回归：https://en.wikipedia.org/wiki/Logistic_regression。它将数据集分类为 2（例如正、负）类

为了实现，看一下python scikit线性模型，逻辑回归： http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html 和这里：http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

X = [[...]] #your datasets list of lists(matrix)
y = [...] #your labels list
from sklearn.liner_mode import LogisticRegression
clf = LogisticRegresion()
clf.fit(X,y)

这个例子也可以：http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html

【讨论】：

【解决方案2】：

您还可以使用朴素贝叶斯来预测数据集的类别，通过使用朴素贝叶斯，您可以得到每个类别的概率，因此在您的示例中，您将得到数据集 1 类 A 的概率为 70%，类 B 为30%

根据您的示例，您需要使用类列作为标签列，并使用 0.1 和 0.9 作为特征列

轻松使用您的数据运行我为这次运行选择了 A == 1 & B == 2

【讨论】：