【发布时间】:2021-10-29 23:45:38
【问题描述】:
我将机器学习算法与 caret 包(caretlist)一起应用,根据多个变量(例如年龄、性别、吸烟者等)预测一组患者的死亡情况:
algorithmList <- c('rf', 'pls','parRF','nnet', 'xgbTree','avNNet',
'gbm','monmlp','nb','glm','pcaNNet','lda','C5.0',
'svmLinear2','knn')
set.seed(100)
list_models <- caretList(Death_event~., data=na.exclude(dataset), methodList = algorithmList, metric="ROC", trControl=control)
然后,我使用 varImp 命令从该算法列表中提取变量重要性,从而生成列表列表
importance <- lapply(list_models, varImp)
输出:
> str(importance)
List of 15
$ rf :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 53.8 4.1 100 7.44 0 ...
..$ model : chr "rf"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ pls :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 15.91 4.88 100 18.95 0 ...
..$ model : chr "pls"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ parRF :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 51.26 3.74 100 7.66 0 ...
..$ model : chr "parRF"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ nnet :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 14 41.9 56.4 62.1 31.2 ...
..$ model : chr "nnet"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ xgbTree :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 100 48.1 40.2 21.5 21.1 ...
..$ model : chr "xgbTree"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ avNNet :List of 3
..$ importance:'data.frame': 11 obs. of 2 variables:
.. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
.. ..$ Death : num [1:11] 14.37 14.36 100 45.4 9.04 ...
..$ model : chr "ROC curve"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ gbm :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 13.543 0.749 100 6.743 0 ...
..$ model : chr "gbm"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ monmlp :List of 3
..$ importance:'data.frame': 11 obs. of 2 variables:
.. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
.. ..$ Death : num [1:11] 14.37 14.36 100 45.4 9.04 ...
..$ model : chr "ROC curve"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ nb :List of 3
..$ importance:'data.frame': 11 obs. of 2 variables:
.. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
.. ..$ Death : num [1:11] 14.37 14.36 100 45.4 9.04 ...
..$ model : chr "ROC curve"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ glm :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 13 27.3 100 50.5 11.6 ...
..$ model : chr "glm"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ pcaNNet :List of 3
..$ importance:'data.frame': 11 obs. of 2 variables:
.. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
.. ..$ Death : num [1:11] 14.37 14.36 100 45.4 9.04 ...
..$ model : chr "ROC curve"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ lda :List of 3
..$ importance:'data.frame': 11 obs. of 2 variables:
.. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
.. ..$ Death : num [1:11] 14.37 14.36 100 45.4 9.04 ...
..$ model : chr "ROC curve"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ C5.0 :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 100 100 100 100 100 ...
..$ model : chr "C5.0"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ svmLinear2:List of 3
..$ importance:'data.frame': 11 obs. of 2 variables:
.. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
.. ..$ Death : num [1:11] 14.37 14.36 100 45.4 9.04 ...
..$ model : chr "ROC curve"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ knn :List of 3
..$ importance:'data.frame': 11 obs. of 2 variables:
.. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
.. ..$ Death : num [1:11] 14.37 14.36 100 45.4 9.04 ...
..$ model : chr "ROC curve"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
那么,我面临第一个问题
在算法的一半中,使用不同的方法(ROC 方法)提取重要性。这不会改变任何解释,但在某些算法中,标题是“重要性”,而在其他算法中,标题是“总体”,但它是完全相同的信息:
$gbm
gbm variable importance
Overall
Age_at_CT 100.0000
Muscle_HU 48.6376
history_of_CV_yes_noat_leasT_1CV_event 38.1153
VAT_Area_cm2 19.3376
Liver_HU_Median 17.7983
SAT_Area_cm2 17.3343
L3_SMI_cm2m2 15.5910
BMI 13.5431
Tobacco_yes_noSmoker 6.7431
SexMale 0.7494
T2D_at_CTDiabetes 0.0000
$monmlp
ROC curve variable importance
Importance
Age_at_CT 100.000
Muscle_HU 87.085
history_of_CV_yes_no 61.254
VAT_Area_cm2 49.174
Liver_HU_Median 47.712
Tobacco_yes_no 45.404
BMI 14.372
Sex 14.363
T2D_at_CT 9.035
L3_SMI_cm2m2 7.453
SAT_Area_cm2 0.000
您可能已经在结构中注意到,对于那些使用 ROC 方法提取重要性的算法,有两个子列(death 和 no_death),但两者的数字完全相同。
我要创建的是一个简单的小标题/数据框,其中:
第一列=算法的名称(这里是列表的名称,例如gbm或monmlp),第二列=变量的名称(例如Age_at_CT,muscle_HU等),第三列=重要性编号( which = 在某些算法中为“重要性”,在其他算法中为“总体”)
我发现的唯一解决方法是将列表和 c/c 打印到每个算法的 excel 工作表算法中(是的......这很糟糕)。
【问题讨论】:
-
control-变量在caretList中丢失,请提供。 -
此外,例如在
knn中,您有两个Importance值而不是一个。您希望在生成的data.frame中包含哪一个?