scikit-learn 逻辑回归特征重要性答案

【问题标题】：scikit-learn logistic regression feature importancescikit-learn 逻辑回归特征重要性
【发布时间】：2018-09-23 14:12:07
【问题描述】：

我正在寻找一种方法来了解我在分类问题中使用的功能的影响。使用 sklearn 的逻辑回归分类器 (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)，我了解到 .coef_ 属性为我提供了我所追求的信息（也在此线程中讨论：How to find the importance of the features for a logistic regression model?）。

我的矩阵的前几行：

phrase_type,type,complex_np,np_form,referentiality,grammatical_role,ambiguity,anaphor_type,dir_speech,length_of_span,length_of_coref_chain,position_in_coref_chain,position_in_sentence,is_topic
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,1,-1,18,True
np,anaphoric,no,defnp,referring,sbj,not_ambig,anaphor_nominal,text_level,2,2,1,1,True
np,none,no,defnp,discourse-new,sbj,not_ambig,_unspecified_,text_level,2,1,-1,9,True

第一行是标题，后面是数据（在我的代码中使用预处理器的 LabelEncoder 将其转换为整数）。

现在，当我做一个

print(classifier.coef_)

我明白了

[[ 0.84768459 -0.56344453  0.00365928  0.21441586 -1.70290447 -0.18460676
   1.6167634   0.08556331  0.02152226 -0.05111953  0.07310608 -0.073653  ]]

其中包含 12 列/元素。我对此感到困惑，因为我的数据包含 13 列（加上第 14 列带有标签，我稍后在代码中将特征与标签分开）。我想知道 sklearn 是否期望/假设第一列是 id 并且实际上并没有使用该列的值？但我找不到这方面的任何信息。

这里的任何帮助将不胜感激！

【问题讨论】：

documentation 表示当给定问题是二进制时，coef_ 的形状应该为 (1, n_features)，因此看起来有些问题。可以发一些代码，让大家看看吗？
请提供Minimal, Complete, and Verifiable example
请打印您输入到分类器.fit 方法中的 X_train.shape。您似乎不小心忽略了有用的专栏。
感谢@Alexey，这为我指明了正确的方向。如果您能简单地看一下下面的帖子并确认我的理解，那就太好了！

标签： python scikit-learn logistic-regression

【解决方案1】：

不确定如何编辑我的原始问题，以便将来参考仍然有意义，因此我将在此处发布一个最小示例：

import pandas
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
from collections import defaultdict
import numpy

headers = ['phrase_type','type','complex_np','np_form','referentiality','grammatical_role','ambiguity','anaphor_type','dir_speech','length_of_span','length_of_coref_chain','position_in_coref_chain','position_in_sentence','is_topic']
matrix = [
['np','none','no','no,pds','referring','dir-obj','not_ambig','_unspecified_','text_level','1','1','-1','1','True'],
['np','none','no','pds','not_specified','sbj','not_ambig','_unspecified_','text_level','1','1','-1','21','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','1','-1','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','8','2','0','6','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','2','0','4','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','21','1','-1','1','True'],
['np','anaphoric','no','ne','referring','other','not_ambig','anaphor_nominal','text_level','1','9','4','2','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','3','9','5','1','True'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','9','7','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','2','1','1','True'],
['np','anaphoric','no','ne','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','2','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','9','1','13','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','3','0','5','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','6','1','-1','1','False'],
['np','none','no','ne','discourse-new','sbj','not_ambig','_unspecified_','text_level','2','9','0','1','False'],
['np','none','yes','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','5','1','-1','5','False'],
['np','anaphoric','no','defnp','referring','sbj','not_ambig','anaphor_nominal','text_level','2','3','1','5','False'],
['np','none','no','defnp','discourse-new','sbj','not_ambig','_unspecified_','text_level','3','3','0','1','True'],
['np','anaphoric','no','pper','referring','sbj','not_ambig','anaphor_nominal','text_level','1','3','1','1','True'],
['np','anaphoric','no','pds','referring','sbj','not_ambig','anaphor_nominal','text_level','1','1','-1','2','True']
]


df = pandas.DataFrame(matrix, columns=headers)
d = defaultdict(LabelEncoder)
fit = df.apply(lambda x: d[x.name].fit_transform(x))
df = df.apply(lambda x: d[x.name].transform(x))

testrows = []
trainrows = []
splitIndex = len(matrix)/10
for index, row in df.iterrows():
    if index < splitIndex:
        testrows.append(row)
    else:
        trainrows.append(row)
testdf = pandas.DataFrame(testrows)
traindf = pandas.DataFrame(trainrows)
train_labels = traindf.is_topic
labels = list(set(train_labels))
train_labels = numpy.array([labels.index(x) for x in train_labels])
train_features = traindf.iloc[:,0:len(headers)-1]
train_features = numpy.array(train_features)
print('train features shape:', train_features.shape)
test_labels = testdf.is_topic
labels = list(set(test_labels))
test_labels = numpy.array([labels.index(x) for x in test_labels])
test_features = testdf.iloc[:,0:len(headers)-1]
test_features = numpy.array(test_features)

classifier = LogisticRegression()
classifier.fit(train_features, train_labels)
print(classifier.coef_)
results = classifier.predict(test_features)
f1 = f1_score(test_labels, results)
print(f1)

我想我可能已经找到了错误的根源（感谢@Alexey Trofimov 为我指明了正确的方向）。我的代码最初包含：

train_features = traindf.iloc[:,1:len(headers)-1]

这是从另一个脚本复制的，我确实将 id 作为矩阵中的第一列，因此不想考虑这些。那么，如果我理解正确，那么 len(headers)-1 就是不考虑实际标签。在现实世界的场景中对此进行测试，删除 -1 会产生完美的 f 分数，这是有道理的，因为它只会查看实际标签并始终正确预测...... 所以我现在把它改成

train_features = traindf.iloc[:,0:len(headers)-1]

就像在代码 sn-p 中一样，现在得到 13 列（在 X_train.shape 中，因此在 classifier.coef_ 中）。我认为这解决了我的问题，但仍然不是 100% 相信，所以如果有人能指出这行推理/我上面的代码中的错误，我会很感激听到它。

【讨论】：

is_topic 是你的标签吗？如果是这样，更传统的代码将类似于：y = 'is_topic' X = df.drop(['is_topic'], axis=1).columns 然后您使用 df[y] 引用您的标签，使用 df[X] 引用您的功能
是的，这确实是标签。好的，谢谢你的提示！我将在以后的尝试中使用该表单。
没问题，只是想帮忙 :) 我能问一下您为什么要以这种方式进行训练/测试拆分吗？通常你会有一个随机元素来分裂，大多数人只是使用 sklearn 的train_test_split。如果您不想使用 sklearn，则很容易复制。如果你必须坚持你的方法，那么我可以建议像 testrows=df.iloc[:splitIndex] 和 trainrows=df.iloc[splitIndex:] 这样的东西来避免循环通过你的数据框吗？
感谢另一个有用的提示 :)。这样做的原因是我将此代码包装在 x-fold 交叉验证循环中，在每次运行中我想覆盖矩阵的不同部分作为测试集（与 x 运行相反，使用每次随机测试集）。通过数据帧的循环确实有点低效，但到目前为止还不是一个真正的问题（就执行时间而言）。到目前为止，大多数情况下运行最多 5k 个数据实例。
好吧，你好像知道你在做什么 :) 我可能会看看使用cross_val_score，因为你基本上只是在选择折叠之前进行 10 折交叉验证而没有洗牌.默认情况下，shuffle 已关闭，因此数据集保持有序。要获得 roc_auc，您可以执行 cross_val_score(classifier , df[X], df[y], scoring='roc_auc', cv=StratifiedKFold(n_splits=10, shuffle = False)) 之类的操作。