特征选择资源及总结

https://stackoverflow.com/questions/49345578/how-to-decide-threshold-value-in-selectfrommodel-for-selecting-features
思考：
1）参照该博客中基于随机森林的特征重要性排序并打印代码，修改成基于SVM的特征重要性排序并打印。
文中先基于随机森林分类器训练模型，再将训练完的模型中的特征进行排序并打印出来
原代码如下：

feat_labels = data.columns[1:]
clf = RandomForestClassifier(n_estimators=100, random_state=0)

# Train the classifier
clf.fit(X_train, y_train)

importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))

结果：
特征选择资源及总结

参考该思路实现基于SVM的特征排序
遇到问题解决思路：
a:参考的以上RF代码打印特征重要性，但是svc没有feature_importances_这个属性
b:查找SVC中的参数（博客https://blog.csdn.net/The_lastest/article/details/78637660，SVC官文http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC）寻找表示特征重要性的属性，官文中提到coef_表示特征的权重
c:但是依据上面代码打印时出问题，提示维度超出范围，通过打印调试得知，coef_是（1,55）而feature_importances是（55，），通过reshape（55，）解决维度不一致问题，可复用决策树的代码。（遇到问题学会调试，打印查问题）
基于SVM的特征排序代码：

feat_labels = data.columns[0:55]#默认取的第0行的，第0：55列
#print(feat_labels)
clf = SVC(kernel="linear")
# Train the classifier
clf.fit(x_lcx, y_lcx)

importances = clf.coef_ #刚开始后面多加了（）老是提示不能引用
importances=importances.reshape(55,)
#print(type(clf.coef_))#<class 'numpy.ndarray'>
indices = np.argsort(importances)[::-1]#对重要性进行排序并给出索引值，是<class 'numpy.ndarray'>
print(importances.shape)
print(indices.shape)
for f in range(x_lcx.shape[1]):
   print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))

2）需要更多的思考如何去模仿，去应用。在1）的基础上基于不同的estimator进行特征重要性排序后，如何基于SelectFromModel 筛选特征，如何设置特征筛选的重要性的阈值？

# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.15
sfm = SelectFromModel(clf, threshold=0.15)#clf即为上一部分设置的分类器 0.15即为特征筛选重要性的阈值
# Train the selector
sfm.fit(X_train, y_train)

可以考虑如下思路：
I would try the following approach:

a) start with a low threshold, for example: 1e-4
b) reduce your features using SelectFromModel fit & transform
c) compute metrics (accuracy, etc.) for your estimator (RandomForestClassifier in your case) for selected features
d) increase threshold and repeat all steps starting from point 1.
Using this approach you can estimate what is the best threshold for your particular data and your estimator
以上思路总结为：设置小的阈值— 基于SelectFromModel fit & transform训练一次模型— 基于验证集验证训练模型— 筛选最优的阈值（查阅基于SelectFromModel筛选特征训练模型的步骤）。该思路后续基于查阅的步骤，以SVM为评估器进行实验。

特征选择的几种方法
2.1特征选择分三大类，具体介绍：https://blog.csdn.net/rui307/article/details/51243796
2.2 scikit-learn特征选择的官文：http://scikit-learn.org/stable/modules/feature_selection.html
2.2.1特征选择–scikit-learn 5类展开介绍，对官文的翻译https://blog.csdn.net/a1368783069/article/details/52048349
5类：1）去掉取值变化小的特征 2）单变量特征选择 3）递归特征消除RFE 4）SelectFromModel 5） Feature selection as part of a pipeline
2.2.2 重点学习和实验了3） 4）两种特征选择方法
a:递归特征消除RFE—具体介绍的博客：https://blog.csdn.net/FontThrone/article/details/79004874
官文：http://scikit-learn.org/stable/auto_examples/plot_kernel_approximation.html#sphx-glr-auto-examples-plot-kernel-approximation-py
b:SelectFromModel—具体介绍的博客：https://blog.csdn.net/FontThrone/article/details/79064930
总结：当前只对a,b两种方法进行了简单的实验，初步得到筛选的lcx的特征和论文的区别和一致性。后续需要继续完善实验，通过验证集筛选a的特征个数，b的阈值。整个代码需要完善起来，完成特征提取实验，结合整个论文的思路完善实验，最终得到基于特定数据库的效果较好的特征筛选方法。关键是完善代码，写出一个完成的代码程序。
特征选取算法介绍全面的博客：http://www.cnblogs.com/wymlnn/p/4569437.html
https://blog.csdn.net/fisherming/article/details/79925574 实例化介绍特征选择及数据处理，包含3篇引用的好博客，值得后续研究
Tree ensemble算法的特征重要度计算https://blog.csdn.net/yimingsilence/article/details/71713751
基于树的集成进行特征选择，待验证思路之一。
基于xgboost筛选特征重要性https://blog.csdn.net/q383700092/article/details/53698760待验证思路之一。
http://dataunion.org/14072.html干货：结合Scikit-learn介绍几种常用的特征选择方法
sklearn库feature selection特征选择算法及API使用https://blog.csdn.net/cymy001/article/details/79425960一篇较为全面较新的文章
https://blog.csdn.net/adore1993/article/details/53980327总结特征选择（feature selection）算法笔记