Python scikit learn 中支持向量机模型分数的可变性/随机性答案

【问题标题】：Variability/randomness of Support Vector Machine model scores in Python's scikitlearnPython scikit learn 中支持向量机模型分数的可变性/随机性
【发布时间】：2021-05-05 16:41:45
【问题描述】：

我正在测试几个 ML 分类模型，在本例中是支持向量机。我对 SVM 算法及其工作原理有基本的了解。

我正在使用来自 scikit learn 的内置乳腺癌数据集。

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC

使用下面的代码：

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, 
                                                    stratify=cancer.target, random_state=42)
clf2 = LinearSVC(C=0.01).fit(X_train, y_train)
clf3 = LinearSVC(C=0.1).fit(X_train, y_train)
clf4 = LinearSVC(C=1).fit(X_train, y_train)
clf5 = LinearSVC(C=10).fit(X_train, y_train)
clf6 = LinearSVC(C=100).fit(X_train, y_train)

当打印分数时：

print("Model training score with C=0.01:\n{:.3f}".format(clf2.score(X_train, y_train)))
print("Model testing score with C=0.01:\n{:.3f}".format(clf2.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=0.1:\n{:.3f}".format(clf3.score(X_train, y_train)))
print("Model testing score with C=0.1:\n{:.3f}".format(clf3.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=1:\n{:.3f}".format(clf4.score(X_train, y_train)))
print("Model testing score with C=1:\n{:.3f}".format(clf4.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=10:\n{:.3f}".format(clf5.score(X_train, y_train)))
print("Model testing score with C=10:\n{:.3f}".format(clf5.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=100:\n{:.3f}".format(clf6.score(X_train, y_train)))
print("Model testing score with C=100:\n{:.3f}".format(clf6.score(X_test, y_test)))

当我运行这段代码时，我会根据不同的正则化参数 C 获得一定的分数。当我再次运行 .fit 行（也就是再次训练它们）时，这些分数会完全不同。有时它们甚至有很大的不同（例如，对于相同的 C 值，分别为 72% 和 90%）。

这种可变性从何而来？我认为，假设我使用相同的 random_state 参数，它总是会找到相同的支持向量，因此会给我相同的结果，但是由于当我再次训练模型时分数会发生变化，所以情况并非如此。例如，在逻辑回归中，无论我是否运行拟合，分数总是一致的。再次编码。

解释准确度分数的这种可变性会很有帮助！

【问题讨论】：

标签： python machine-learning scikit-learn svm

【解决方案1】：

当然。 您需要将random_state=None 修复为特定的种子，以便重现结果。

否则，您将使用默认的 random_state=None，因此，每次调用命令时，都会使用随机种子，这就是您得到这种可变性的原因。

用途：

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, 
                                                    stratify=cancer.target, random_state=42)
clf2 = LinearSVC(C=0.01,random_state=42).fit(X_train, y_train)
clf3 = LinearSVC(C=0.1, random_state=42).fit(X_train, y_train)
clf4 = LinearSVC(C=1,   random_state=42).fit(X_train, y_train)
clf5 = LinearSVC(C=10,  random_state=42).fit(X_train, y_train)
clf6 = LinearSVC(C=100, random_state=42).fit(X_train, y_train)

【讨论】：

有道理，谢谢！但是，当我为逻辑回归创建具有不同 C 值的类似模型时，为什么不必指定每个模型的随机状态？
你有。但是如果你得到相同的结果，这意味着结果和估计对于 C 参数是不变的。在其他设置下，如果不修复随机种子，结果将永远不会相同。
模型参数中的这个 random_state 指示使用哪个伪随机数来启动梯度坐标下降，因此如果要评估它们，它们在所有模型中必须相同。而 train_test_split 函数中的 random_state 涵盖了如何随机绘制训练/测试数据。这是正确的吗？