我的 sklearn_crfsuite 模型没有学到任何东西答案

【问题标题】：My sklearn_crfsuite model does not learn anything我的 sklearn_crfsuite 模型没有学到任何东西
【发布时间】：2020-07-12 07:52:04
【问题描述】：

我正在尝试按照教程here 创建一个注释预测模型，但我的模型没有学到任何东西。这是我的训练数据和标签的示例：

[{'bias': 1.0, 'word.lower()': '\nreference\nissue\ndate\ndgt86620\n4\n \n19-dec-05\nfalcon\n7x\ntype\n认证\n27_4-100\nthis\ndocument\nis\nthe\ntellectual\nprop...nairbrake\nhandle\nposition\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n0\ntable\n1\n:\nairbrake\ncas\nmmessages\n', 'word[-3:]': 'es\n', 'word[-2:]'：'s\n'，'word.isupper()'：假，'word.istitle()'：假， 'word.isdigit()': False, 'postag': 'POS', 'postag[:2]': 'PO', “w_emb_0”：0.03418987928976114，“w_emb_1”：0.617338281 1066742， 'w_emb_2'：0.004420982990809508，'w_emb_3'：0.08293022662242588， “w_emb_4”：0.22162269482070363，“w_emb_5”：0.4334545347397811， “w_emb_6”：0.7844891779932379，“w_emb_7”：0.028043262790094503， 'w_emb_8'：0.5233847386564157，'w_emb_9'：0.9685677133128328，'w_em b_10'：0.19379126558708126，'w_emb_11'：0.2809608896964926， 'w_emb_12'：0.384759230815804，'w_emb_13'：0.15385904662767336， 'w_emb_14'：0.5206500040610533，'w_emb_15'：0.009148526006733215， 'w_emb_16'：0.5894118695171416，'w_emb_17'：0.7356989708459056， 'w_emb_18': 0. 5576774100159024, 'w_emb_19': 0.2185294430010376, 'BOS': 真，'+1:word.lower()': 'reference', '+1:word.istitle()': 假，'+1:word.isupper()'：真，'+1:postag'：'POS'，'+1:postag[:2]'： 'PO'}, {'bias': 1.0, 'word.lower()': 'reference', 'word[-3:]': 'NCE', 'word[-2:]'：'CE'，'word.isupper()'：真，'word.istitle()'：假， 'word.isdigit()': False, 'postag': 'POS', 'postag[:2]': 'PO', 'w_emb_0'：-0.390038，'w_emb_1'：0.30677223，'w_emb_2'：-1.010975， 'w_emb_3'：0.3656154，'w_emb_4'：0.5319459，'w_emb_5'：0.45572615， 'w_emb_6'：-0.4 6090943，'w_emb_7'：0.87250936，'w_emb_8'： 0.036648277，'w_emb_9'：-0.3057043，'w_emb_10'：0.33427167，'w_emb_11'：-0.19664396，'w_emb_12'：-0.64899784，'w_emb_13'： -0.1785065，'w_emb_14'：-0.117423356，'w_emb_15'：0.16247013，'w_emb_16'：0.11694676，'w_emb_17'：-0.30 693895，'w_emb_18'： -1.0026807, 'w_emb_19': 0.9946743, '-1:word.lower()': '\nreference...n \n \n \n \n \n \n \n \n0\ntable\n1\n:\nairbrake\ncas\nmessages\n', '-1:word.istitle()': 假，'-1：word.isupper（）'：假，'-1：postag'：'POS'， '-1:postag[:2]': 'PO', '+1:word.lower()': 'issue', '+1:word.istitle()': 错误，'+1:word. isupper()': 真，'+1:postag': 'POS', '+1:postag[:2]': 'PO'}, {'bias': 1.0, 'word.lower()': 'issue', 'word[-3:]'：'SUE'，'word[-2:]'：'UE'，'word.isupper()'：真， 'word.istitle()'：假，'word.isdigit()'：假，'postag'：'POS'， 'postag [：2]'：'PO'，'w_emb_0'：-1.220 4882，'w_emb_1'：0.8920707， 'w_emb_2'：-3.8380668，'w_emb_3'：1.5641377，'w_emb_4'：2.1918254， 'w_emb_5'：1.8509868，'w_emb_6'：-2.0664182，'w_emb_7'：3.1591077， 'w_emb_8'：-0.33126026，'w_emb_9'：-1.4278139，'w_emb_10'：0.9291533， 'w_emb_11'：-0.6761407，'w_emb_12'： -2.9582167，'w_emb_13'：-0.5395561，'w_emb_14'：-0.8363763，'w_emb_15'：0.25568742，'w_emb_16'：0.4932978，'w_emb_17'：-1.6198335， 'w_emb_18'：-4.183924，'w_emb_19'：4.281094，'-1：word.lower()'： '参考'，'-1：word.istitle（）'：假，'-1：word.isupper（）'：真， '-1:postag': 'POS', '-1:postag[:2]': 'PO', '+1:word.lower()': '日期', '+1:word.istitle()': 假，'+1:word.isupper()': 真，'+1:postag': 'POS', '+1:postag[:2]': 'PO'}...]
y_train = ['O', 'O', 'O'...'I-data-c-a-s_message-type'....'B-data-c-a-s_message-type']

这是模型定义和训练：

crf = sklearn_crfsuite.CRF(
            algorithm='lbfgs',
            c1=0.1,
            c2=0.1,
            max_iterations=100,
            all_possible_transitions=True
        )
crf.fit(X_train, y_train)

y_pred = crf.predict(X_test)
sorted_labels = sorted(labels, key=lambda name: (name[1:], name[0]))

msg = metrics.flat_classification_report(y_test, y_pred, labels=labels, digits=4)
print(msg)

不幸的是，我的模型没有学到任何东西：

                           precision    recall  f1-score   support   
B-data-c-a-s_message-type     0.0000    0.0000    0.0000        23  
I-data-c-a-s_message-type     0.0000    0.0000    0.0000        90
                micro avg     0.0000    0.0000    0.0000       113
                macro avg     0.0000    0.0000    0.0000       113
             weighted avg     0.0000    0.0000    0.0000       113

【问题讨论】：

标签： python scikit-learn crfsuite

【解决方案1】：

问题解决了。如您所见，支持（评估样本数）总共为 113。但是，训练集中的样本数仅为 14 左右！这太小了！我只是没有注意到这种差异。我已经反转了训练和测试数据集，现在，性能是这样的：

                            precision    recall  f1-score   support
B-data-c-a-s_message-type     0.0000    0.0000    0.0000     0     
I-data-c-a-s_message-type     0.6364    1.0000    0.7778     14
                micro avg     0.6364    1.0000    0.7778     14                    
                macro avg     0.3182    0.5000    0.3889     14             
             weighted avg     0.6364    1.0000    0.7778      14

【讨论】：