【问题标题】:My sklearn_crfsuite model does not learn anything我的 sklearn_crfsuite 模型没有学到任何东西
【发布时间】:2020-07-12 07:52:04
【问题描述】:

我正在尝试按照教程here 创建一个注释预测模型,但我的模型没有学到任何东西。 这是我的训练数据和标签的示例:

[{'bias': 1.0, 'word.lower()': '\nreference\nissue\ndate\ndgt86620\n4\n \n19-dec-05\nfalcon\n7x\ntype\n认证\n27_4-100\nthis\ndocument\nis\nthe\ntellectual\nprop...nairbrake\nhandle\nposition\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n0\ntable\n1\n:\nairbrake\ncas\nmmessages\n', 'word[-3:]': 'es\n', 'word[-2:]':'s\n','word.isupper()':假,'word.istitle()':假, 'word.isdigit()': False, 'postag': 'POS', 'postag[:2]': 'PO', “w_emb_0”:0.03418987928976114,“w_emb_1”:0.617338281 1066742, 'w_emb_2':0.004420982990809508,'w_emb_3':0.08293022662242588, “w_emb_4”:0.22162269482070363,“w_emb_5”:0.4334545347397811, “w_emb_6”:0.7844891779932379,“w_emb_7”:0.028043262790094503, 'w_emb_8':0.5233847386564157,'w_emb_9':0.9685677133128328,'w_em b_10':0.19379126558708126,'w_emb_11':0.2809608896964926, 'w_emb_12':0.384759230815804,'w_emb_13':0.15385904662767336, 'w_emb_14':0.5206500040610533,'w_emb_15':0.009148526006733215, 'w_emb_16':0.5894118695171416,'w_emb_17':0.7356989708459056, 'w_emb_18': 0. 5576774100159024, 'w_emb_19': 0.2185294430010376, 'BOS': 真,'+1:word.lower()': 'reference', '+1:word.istitle()': 假,'+1:word.isupper()':真,'+1:postag':'POS','+1:postag[:2]': 'PO'}, {'bias': 1.0, 'word.lower()': 'reference', 'word[-3:]': 'NCE', 'word[-2:]':'CE','word.isupper()':真,'word.istitle()':假, 'word.isdigit()': False, 'postag': 'POS', 'postag[:2]': 'PO', 'w_emb_0':-0.390038,'w_emb_1':0.30677223,'w_emb_2':-1.010975, 'w_emb_3':0.3656154,'w_emb_4':0.5319459,'w_emb_5':0.45572615, 'w_emb_6':-0.4 6090943,'w_emb_7':0.87250936,'w_emb_8': 0.036648277,'w_emb_9':-0.3057043,'w_emb_10':0.33427167,'w_emb_11':-0.19664396,'w_emb_12':-0.64899784,'w_emb_13': -0.1785065,'w_emb_14':-0.117423356,'w_emb_15':0.16247013,'w_emb_16':0.11694676,'w_emb_17':-0.30 693895,'w_emb_18': -1.0026807, 'w_emb_19': 0.9946743, '-1:word.lower()': '\nreference...n \n \n \n \n \n \n \n \n0\ntable\n1\n:\nairbrake\ncas\nmessages\n', '-1:word.istitle()': 假,'-1:word.isupper()':假,'-1:postag':'POS', '-1:postag[:2]': 'PO', '+1:word.lower()': 'issue', '+1:word.istitle()': 错误,'+1:word. isupper()': 真,'+1:postag': 'POS', '+1:postag[:2]': 'PO'}, {'bias': 1.0, 'word.lower()': 'issue', 'word[-3:]':'SUE','word[-2:]':'UE','word.isupper()':真, 'word.istitle()':假,'word.isdigit()':假,'postag':'POS', 'postag [:2]':'PO','w_emb_0':-1.220 4882,'w_emb_1':0.8920707, 'w_emb_2':-3.8380668,'w_emb_3':1.5641377,'w_emb_4':2.1918254, 'w_emb_5':1.8509868,'w_emb_6':-2.0664182,'w_emb_7':3.1591077, 'w_emb_8':-0.33126026,'w_emb_9':-1.4278139,'w_emb_10':0.9291533, 'w_emb_11':-0.6761407,'w_emb_12': -2.9582167,'w_emb_13':-0.5395561,'w_emb_14':-0.8363763,'w_emb_15':0.25568742,'w_emb_16':0.4932978,'w_emb_17':-1.6198335, 'w_emb_18':-4.183924,'w_emb_19':4.281094,'-1:word.lower()': '参考','-1:word.istitle()':假,'-1:word.isupper()':真, '-1:postag': 'POS', '-1:postag[:2]': 'PO', '+1:word.lower()': '日期', '+1:word.istitle()': 假,'+1:word.isupper()': 真,'+1:postag': 'POS', '+1:postag[:2]': 'PO'}...]
y_train = ['O', 'O', 'O'...'I-data-c-a-s_message-type'....'B-data-c-a-s_message-type']

这是模型定义和训练:

`

crf = sklearn_crfsuite.CRF(
            algorithm='lbfgs',
            c1=0.1,
            c2=0.1,
            max_iterations=100,
            all_possible_transitions=True
        )
crf.fit(X_train, y_train)

y_pred = crf.predict(X_test)
sorted_labels = sorted(labels, key=lambda name: (name[1:], name[0]))

msg = metrics.flat_classification_report(y_test, y_pred, labels=labels, digits=4)
print(msg)

`

不幸的是,我的模型没有学到任何东西:

                           precision    recall  f1-score   support   
B-data-c-a-s_message-type     0.0000    0.0000    0.0000        23  
I-data-c-a-s_message-type     0.0000    0.0000    0.0000        90
                micro avg     0.0000    0.0000    0.0000       113
                macro avg     0.0000    0.0000    0.0000       113
             weighted avg     0.0000    0.0000    0.0000       113

【问题讨论】:

    标签: python scikit-learn crfsuite


    【解决方案1】:

    问题解决了。 如您所见,支持(评估样本数)总共为 113。但是,训练集中的样本数仅为 14 左右!这太小了!我只是没有注意到这种差异。我已经反转了训练和测试数据集,现在,性能是这样的:

                                precision    recall  f1-score   support
    B-data-c-a-s_message-type     0.0000    0.0000    0.0000     0     
    I-data-c-a-s_message-type     0.6364    1.0000    0.7778     14
                    micro avg     0.6364    1.0000    0.7778     14                    
                    macro avg     0.3182    0.5000    0.3889     14             
                 weighted avg     0.6364    1.0000    0.7778      14
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-03-28
      • 1970-01-01
      • 2019-02-05
      • 2011-10-21
      • 1970-01-01
      • 2021-10-29
      • 1970-01-01
      • 2016-07-05
      相关资源
      最近更新 更多