为什么使用 Vowpal Wabbit 训练时平均损失会上升答案

【问题标题】：Why average loss goes up when training using Vowpal Wabbit为什么使用 Vowpal Wabbit 训练时平均损失会上升
【发布时间】：2015-11-03 05:18:35
【问题描述】：

我尝试使用 VW 在一小部分示例（大约 3112 个）上训练回归模型。我认为我做得对，但它显示了奇怪的结果。挖了一圈，但没有发现任何有用的东西。

$ cat sh600000.feat | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.040000   0.040000            1         1.0  -0.2000   0.0000       79
0.051155   0.062310            2         2.0   0.2000  -0.0496       79
0.046606   0.042056            4         4.0   0.4100   0.1482       79
0.052160   0.057715            8         8.0   0.0200   0.0021       78
0.064936   0.077711           16        16.0  -0.1800   0.0547       77
0.060507   0.056079           32        32.0   0.0000   0.3164       79
0.136933   0.213358           64        64.0  -0.5900  -0.0850       79
0.151692   0.166452          128       128.0   0.0700   0.0060       79
0.133965   0.116238          256       256.0   0.0900  -0.0446       78
0.179995   0.226024          512       512.0   0.3700  -0.0217       79
0.109296   0.038597         1024      1024.0   0.1200  -0.0728       79
0.579360   1.049425         2048      2048.0  -0.3700  -0.0084       79
0.485389   0.485389         4096      4096.0   1.9600   0.3934       79 h
0.517748   0.550036         8192      8192.0   0.0700   0.0334       79 h

finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506


$ wc model
      41      48     657 model

问题：

为什么输出（可读）模型中的特征数量少于实际特征的数量？我计算出训练数据包含 78 个特征（加上训练期间显示的 79 个偏差）。特征比特数为24，应该远远超过避免碰撞。
如上例所示，为什么训练中的平均损失实际上会上升？
（次要）我尝试将特征位数增加到 32，但它输出了一个空模型。为什么？

编辑：

按照建议，我尝试改组输入文件，并使用 --holdout_off。但结果还是几乎一样——平均损失上升了。

$ cat sh600000.feat.shuf | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache --holdout_off
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.040000   0.040000            1         1.0  -0.2000   0.0000       79
0.051155   0.062310            2         2.0   0.2000  -0.0496       79
0.046606   0.042056            4         4.0   0.4100   0.1482       79
0.052160   0.057715            8         8.0   0.0200   0.0021       78
0.071332   0.090504           16        16.0   0.0300   0.1203       79
0.043720   0.016108           32        32.0  -0.2200  -0.1971       78
0.142895   0.242071           64        64.0   0.0100  -0.1531       79
0.158564   0.174232          128       128.0   0.0500  -0.0439       79
0.150691   0.142818          256       256.0   0.3200   0.1466       79
0.197050   0.243408          512       512.0   0.2300  -0.0459       79
0.117398   0.037747         1024      1024.0   0.0400   0.0284       79
0.636949   1.156501         2048      2048.0   1.2500  -0.0152       79
0.363364   0.089779         4096      4096.0   0.1800   0.0071       79
0.477569   0.591774         8192      8192.0  -0.4800   0.0065       79
0.411068   0.344567        16384     16384.0   0.0700   0.0450       77

finished run
number of examples per pass = 3112
passes used = 10
weighted example sum = 31120
weighted label sum = -105.5
average loss = 0.423404
best constant = -0.0033901
total feature number = 2451800

训练示例彼此独特，因此我怀疑是否存在过度拟合问题（据我了解，这通常发生在输入数量与特征数量相比太少时）。

EDIT2：

尝试打印每个示例的平均损失，并看到它大部分保持不变。

$ cat dist/sh600000.feat | vw --l1 1e-8 --l2 1e-8 -f dist/model -P 3112 --passes 10 -b 24 --cache_file dist/cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.498822   0.498822         3112      3112.0   0.0800   0.0015       79 h
0.476677   0.454595         6224      6224.0  -0.2200  -0.0085       79 h
0.466413   0.445856         9336      9336.0   0.0200  -0.0022       79 h
0.490221   0.561506        12448     12448.0   0.0700  -0.1113       79 h

finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506

不带 --l1、--l2 和 -b 参数的另一种尝试：

$ cat dist/sh600000.feat | vw -f dist/model -P 3112 --passes 10 --cache_file dist/cacheNum weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.520286   0.520286         3112      3112.0   0.0800  -0.0021       79 h
0.488581   0.456967         6224      6224.0  -0.2200  -0.0137       79 h
0.474247   0.445538         9336      9336.0   0.0200  -0.0299       79 h
0.496580   0.563450        12448     12448.0   0.0700  -0.1727       79 h
0.533413   0.680958        15560     15560.0  -0.1700   0.0322       79 h
0.524531   0.480201        18672     18672.0  -0.9800  -0.0573       79 h

finished run
number of examples per pass = 2801
passes used = 7
weighted example sum = 19608
weighted label sum = -212.58
average loss = 0.491739 h
best constant = -0.0108415
total feature number = 1544713

这是否意味着一次通过时平均损失增加是正常的，但只要多次获得相同的损失就可以了吗？

【问题讨论】：

查看 truf 的完整正确答案。只有两个小 cmets：对于大约 100 个独特的功能，不需要使用 -b 24。即使在最坏的情况下，特征名称从不重复（3000 个示例 x 每个示例 79 个特征），您也有 237000 个唯一特征，小于默认的 vw 权重向量空间：2^18 = 262144 个插槽。此外，无需使用额外的cat 进程，您可以将数据文件直接传递给vw。 HTH。

标签： vowpalwabbit

【解决方案1】：

模型文件仅存储非零权重。所以很可能其他人都被取消了，特别是如果你使用--l1
这可能是由多种原因引起的。也许您的数据集洗牌不够好。如果您对数据集进行排序，因此标记为 -1 的示例将位于前半部分，标记为 1 的示例将位于第二部分，那么您的模型将在前半部分显示出非常好的收敛性，但是当它到达第二半部分时，您会看到平均损失增加。所以它可能是数据集的不平衡。至于最后两个损失 - 这些是保持损失（在行尾标有“h”），可能表明模型过度拟合。请参考我的other answer。
好吧，在 master 分支中，-b 32 的使用目前甚至被阻止了。你应该使用up to -b 31。在实践中，-b 24-28 通常足以处理成千上万个特征。

我建议你从github获取最新的大众版本

【讨论】：

On (2)：请注意，默认情况下vw 使用十分之一的保留值来估计多次传递中的错误，并且您可能已经“不走运”，因为您最大的异常值（仅基于在进度输出上，我可能是错的）标签 1.9600，被选为保留集。假设您在小标签上进行训练并在大标签上进行测试（平均而言），那么您添加的通过次数越多，误差就会越大。尝试添加 --holdout_off（或将 --holdout_period <N> 更改为与示例数量相对质数的数字）以获得更小（不一定更好）的火车损失。
@truf，我的输入没有排序。但为了安全起见，我在看到您的答案后重新洗牌并再次尝试 - 仍然增加平均损失。
我还按照@arielf 的建议禁用了坚持。也不修。如果你们能建议我检查的可能性，那就太好了。如果需要（并且可能），我可以共享数据集。
我不确定它是否真的在增加。您能否将数据集中的示例数量（来自number of examples per pass = N）并将-P N 添加到命令行。在这种情况下，大众将打印每次通过的平均损失（而不是像往常一样每 2^n 个示例）。还会增加吗？无论如何，我会使用-l 参数来获得更好的收敛性。（更新：-l 我的意思不是--l1 或--l2）
感谢@truf 的建议。已更新，请参阅 EDIT2。