【发布时间】:2021-05-05 16:41:45
【问题描述】:
我正在测试几个 ML 分类模型,在本例中是支持向量机。我对 SVM 算法及其工作原理有基本的了解。
我正在使用来自 scikit learn 的内置乳腺癌数据集。
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
使用下面的代码:
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
stratify=cancer.target, random_state=42)
clf2 = LinearSVC(C=0.01).fit(X_train, y_train)
clf3 = LinearSVC(C=0.1).fit(X_train, y_train)
clf4 = LinearSVC(C=1).fit(X_train, y_train)
clf5 = LinearSVC(C=10).fit(X_train, y_train)
clf6 = LinearSVC(C=100).fit(X_train, y_train)
当打印分数时:
print("Model training score with C=0.01:\n{:.3f}".format(clf2.score(X_train, y_train)))
print("Model testing score with C=0.01:\n{:.3f}".format(clf2.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=0.1:\n{:.3f}".format(clf3.score(X_train, y_train)))
print("Model testing score with C=0.1:\n{:.3f}".format(clf3.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=1:\n{:.3f}".format(clf4.score(X_train, y_train)))
print("Model testing score with C=1:\n{:.3f}".format(clf4.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=10:\n{:.3f}".format(clf5.score(X_train, y_train)))
print("Model testing score with C=10:\n{:.3f}".format(clf5.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=100:\n{:.3f}".format(clf6.score(X_train, y_train)))
print("Model testing score with C=100:\n{:.3f}".format(clf6.score(X_test, y_test)))
当我运行这段代码时,我会根据不同的正则化参数 C 获得一定的分数。当我再次运行 .fit 行(也就是再次训练它们)时,这些分数会完全不同。有时它们甚至有很大的不同(例如,对于相同的 C 值,分别为 72% 和 90%)。
这种可变性从何而来?我认为,假设我使用相同的 random_state 参数,它总是会找到相同的支持向量,因此会给我相同的结果,但是由于当我再次训练模型时分数会发生变化,所以情况并非如此。 例如,在逻辑回归中,无论我是否运行拟合,分数总是一致的。再次编码。
解释准确度分数的这种可变性会很有帮助!
【问题讨论】:
标签: python machine-learning scikit-learn svm