查找靠近分类器决策边界的数据点答案

【问题标题】：Finding data points close to the decision boundary of a classifier查找靠近分类器决策边界的数据点
【发布时间】：2020-07-08 12:17:45
【问题描述】：

抱歉，这是一个非常简单的问题。但我是这个领域的新手。

我的具体问题是：我已经用 Python 训练了一个 XGboost 分类器。训练结束后，如何让训练数据中的样本比固定值更接近模型的决策边界？

谢谢

【问题讨论】：

标签： python classification xgboost

【解决方案1】：

我不认为 xgboost 是否有内置方法，或者是否有类似 SVC 的数学公式。这种可视化可能有助于二维特征空间：

import xgboost as xgb
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):

    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
                    alpha=0.8, c=cmap(idx),
                    marker=markers[idx], label=cl)

    # highlight test samples
    if test_idx:
        # plot all samples
        if not versiontuple(np.__version__) >= versiontuple('1.9.0'):
            X_test, y_test = X[list(test_idx), :], y[list(test_idx)]
            warnings.warn('Please update to NumPy 1.9.0 or newer')
        else:
            X_test, y_test = X[test_idx, :], y[test_idx]

        plt.scatter(X_test[:, 0],
                    X_test[:, 1],
                    c='',
                    alpha=1.0,
                    linewidths=1,
                    marker='o',
                    s=55, label='test set')

X, y = make_moons(noise=0.3, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

xgb_clf = xgb.XGBClassifier()
xgb_clf = xgb_clf.fit(X_train, y_train)

plot_decision_regions(X_test, y_test, xgb_clf)
plt.show()

plot_decision_regions 函数来自 Python 机器学习一书，可在其公共 GitHub here 上获得。

【讨论】：

谢谢。是的，我知道 2D 中的可视化。我只是想知道是否有一种方法可以识别一般靠近边界的样本。