从零开始支持向量机答案

【问题标题】：Support vector machine from scratch从零开始支持向量机
【发布时间】：2021-03-08 06:19:33
【问题描述】：

我正在尝试从头开始构建线性 SVC。我使用了 MIT course 6.034 中的一些参考资料，以及一些 youtube 视频。我能够让代码运行，但是，结果看起来不正确。我无法弄清楚我做错了什么，如果有人能指出我的错误，那就太好了。如果我理解正确的话，Hinge loss 应该只有一个全局最小值，并且我应该预期成本会单调下降。它肯定会在最后波动。

#Generating data
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
X, y =  make_blobs(n_samples=300, n_features=2, centers=2, cluster_std=1,
                   random_state=42)


# Propose a model ==> (w,b) initialize a random plane
np.random.seed(42)
w = np.random.randn(2,1)
b = np.random.randn(1,1)

# Get output using the proposed model ==> distance score 
def cal_score(point_v,lable):
    return lable * (X @ w + b)
s = cal_score(X,y)

# Evaluate performance of the initial model ==> Hinge Loss
def cal_hinge(score):
    hinge_loss = 1 - score
    hinge_loss[hinge_loss < 0] = 0 #
    cost = 0.5* sum(w**2)  + sum(hinge_loss)/len(y)
    return hinge_loss, cost

_, J = cal_hinge(s)
loss = [J[0]]
print('Cost of initial model: {}'.format(J[0]))

#Gradient descent, update (w,b)
def cal_grad(point_v,lable):
    hinge, _ = cal_hinge(cal_score(point_v,lable))
    grad_w = np.zeros(w.shape)
    grad_b = np.zeros(b.shape)
    for i, h in enumerate(hinge):
        if h == 0:
            grad_w +=  w
        else:
            grad_w += w - (X[i] * y[i]).reshape(-1,1)
            grad_b += y[i]
            
    return grad_w/len(X), grad_b/len(X)

grad_w,grad_b = cal_grad(X,y)
w = w - 0.03*grad_w
b = b - 0.03*grad_b

# Re-evaluation after 1-step gradient descent
s = cal_score(X,y)
_, J = cal_hinge(s)
print('Cost of 1-step model: {}'.format(J[0]))
loss.append(J[0])

#How about 30 steps:
for i in range(28):
    grad_w,grad_b = cal_grad(X,y)
    w = w - 0.04*grad_w
    b = b - 0.03*grad_b
    s = cal_score(X,y)
    _, J = cal_hinge(s)
    loss.append(J[0])
    print('Cost of {}-step model: {}'.format(i+2,J[0]))
    
    
print('Final model: w = {}, b = {}'.format(w,b))

输出

Cost of initial model: 0.13866202810721154
Cost of 1-step model: 0.13150688874177027
Cost of 2-step model: 0.12273179526491895
Cost of 3-step model: 0.11480467935989988
Cost of 4-step model: 0.1075336912554962
Cost of 5-step model: 0.10084006850825472
Cost of 6-step model: 0.09467250631773037
Cost of 7-step model: 0.08898976153627648
Cost of 8-step model: 0.08375382447902188
Cost of 9-step model: 0.07892966542038939
Cost of 10-step model: 0.07448500096528701
Cost of 11-step model: 0.07039007873679798
Cost of 12-step model: 0.06662137485152193
Cost of 13-step model: 0.0631641256490808
Cost of 14-step model: 0.06007003664049003
Cost of 15-step model: 0.05743247238207012
Cost of 16-step model: 0.05547068741404436
Cost of 17-step model: 0.05381989797841767
Cost of 18-step model: 0.05248657667528307
Cost of 19-step model: 0.051457041091025085
Cost of 20-step model: 0.050775749386560806
Cost of 21-step model: 0.0502143321989
Cost of 22-step model: 0.04964305284192223
Cost of 23-step model: 0.04934419897947399
Cost of 24-step model: 0.04918626712575319
Cost of 25-step model: 0.048988709405470836
Cost of 26-step model: 0.048964173310432575
Cost of 27-step model: 0.04890689234556096
Cost of 28-step model: 0.04901146890814169
Cost of 29-step model: 0.04882640882453289
Final model: w = [[ 0.21833245]
 [-0.16428035]], b = [[0.65908854]]

【问题讨论】：

最终结果看起来和你预期的一样，除了第28步，成本单调递减
我将我的代码性能与其他人的代码进行了比较。我认为主要问题是我使用批量梯度下降，而其他人使用 SGD。我不确定为什么 SGD 会产生更好的性能模型。这张图片显示了两种方法之间的主要区别。 user-images.githubusercontent.com/66216181/…

标签： python machine-learning svm

【解决方案1】：

您的代码的实现似乎是正确的。在如此小的利润下，您不必担心成本会略有增加。

当您的学习率乘以梯度“超过”最佳值时，成本就会增加。在这个例子中，它发生的数量非常少，所以我不会担心。

如果您对为什么成本增加感到好奇，我们首先要问为什么不应该呢？梯度下降指向使我们的损失最小化的方向。但是，如果我们的学习率足够大，我们可能会超过最优值并最终获得更大的成本！这就是您的代码本质上所做的，只是规模极小且可忽略不计。

【讨论】：

在这种情况下，分离似乎是可以接受的，但如果我尝试其他随机种子，情况会变得更糟。我附上了一张图片，将我的结果与预期结果进行了比较。 user-images.githubusercontent.com/66216181/…
我不能确定是什么导致了这种差异，但我有几个猜测。首先，您可以尝试让您的算法运行更多次迭代，以便有更多时间对数据进行训练。第二种可能性是你可以尝试调整你的超参数（学习率），也许这会有所帮助。如果这些都不起作用，请告诉我。
如果您有兴趣，这里是 colab 链接。我在我的原始代码中使用了批量梯度下降，还包含了别人的 SGD 代码。我认为 SGD 的性能要好得多，虽然我认为铰链损失是一个凸函数，但两种方法最终都应该有相似的参数。不知道出了什么问题。 colab.research.google.com/drive/…

【解决方案2】：

我知道出了什么问题。在计算铰链损失时，我应该使用 X @ w -b 而不是 X @ w + b；这会影响在梯度下降期间如何更新偏置项。根据我对这些代码的经验，SGD 的改进通常比 batchGD 更好，并且需要更少的超参数调整。但理论上，如果将具有衰减的学习率用于 batchGD，它的性能应该不会比 SGD 差。

【讨论】：