在 R 中使用随机森林模型定义类答案

【问题标题】：defining classes using random forest models in R在 R 中使用随机森林模型定义类
【发布时间】：2018-09-19 08:54:21
【问题描述】：

我对机器学习很陌生，我偶然发现了一个问题，无论我多么努力地谷歌搜索似乎都找不到解决方案。

我使用randomForest 算法执行了多类分类过程，并找到了一个模型，可以对我的测试样本进行充分预测。然后我使用varImpPlot() 来确定哪些预测变量对确定班级分配最重要。

我的问题：我想知道为什么这些预测因素最重要。具体来说，我希望能够报告属于 X 类的病例具有特征 A（例如，男性）、B（例如，年龄较大）和 C（例如，智商高），而属于 X 类的病例具有特征Y 班具有 D（女性）、E（年轻）和 F（低智商）等特征，以此类推。

例如，我知道标准二元逻辑回归允许您说特征 A 值较高的案例更有可能属于 X 类。所以，我希望在概念上相似，但来自多个类的随机森林分类模型。

这是可以使用随机森林模型完成的事情吗？如果是，randomForest 或caret（甚至其他地方）中是否有一个函数可以帮助我通过varImpPlot() 和varImp() 表？

谢谢！

【问题讨论】：

您正在寻找的是变量的相对重要性。 varImpPlot() 的输出是整体变量重要性。
尝试检查：stackoverflow.com/questions/29637145/…stackoverflow.com/questions/47609200/… 请随时更新，因为这是一个非常重要的话题，很难找到答案
每个类的相对重要性的一个可能近似值是构建 N 个模型 1 vs all 其中 N 是要预测的类数。但是，我认为这更像是一种变通方法，而不是针对您所面临问题的真正可靠的解决方案。

标签： r machine-learning random-forest

【解决方案1】：

有一个名为ExplainPrediction 的包承诺解释随机森林模型。这是DESCRIPTION文件的顶部。 URL页面有一个指向an extensive citation list的链接：

Package: ExplainPrediction
Title: Explanation of Predictions for Classification and Regression Models
Version: 1.3.0
Date: 2017-12-27
Author: Marko Robnik-Sikonja
Maintainer: Marko Robnik-Sikonja <marko.robnik@fri.uni-lj.si>
Description: Generates explanations for classification and regression models and visualizes them.
 Explanations are generated for individual predictions as well as for models as a whole. Two explanation methods
 are included, EXPLAIN and IME. The EXPLAIN method is fast but might miss explanations expressed redundantly
 in the model. The IME method is slower as it samples from all feature subsets.
 For the EXPLAIN method see Robnik-Sikonja and Kononenko (2008) <doi:10.1109/TKDE.2007.190734>, 
 and the IME method is described in Strumbelj and Kononenko (2010, JMLR, vol. 11:1-18).
 All models in package 'CORElearn' are natively supported, for other prediction models a wrapper function is provided 
 and illustrated for models from packages 'randomForest', 'nnet', and 'e1071'.
License: GPL-3
URL: http://lkm.fri.uni-lj.si/rmarko/software/
Imports: CORElearn (>= 1.52.0),semiArtificial (>= 2.2.5)
Suggests: nnet,e1071,randomForest

还有：

Package: DALEX
Title: Descriptive mAchine Learning EXplanations
Version: 0.1.1
Authors@R: person("Przemyslaw", "Biecek", email = "przemyslaw.biecek@gmail.com", role = c("aut", "cre"))
Description: Machine Learning (ML) models are widely used and have various applications in classification 
  or regression. Models created with boosting, bagging, stacking or similar techniques are often
  used due to their high performance, but such black-box models usually lack of interpretability.
  'DALEX' package contains various explainers that help to understand the link between input variables and model output.
  The single_variable() explainer extracts conditional response of a model as a function of a single selected variable.
  It is a wrapper over packages 'pdp' and 'ALEPlot'.
  The single_prediction() explainer attributes arts of model prediction to articular variables used in the model.
  It is a wrapper over 'breakDown' package.
  The variable_dropout() explainer assess variable importance based on consecutive permutations.
  All these explainers can be plotted with generic plot() function and compared across different models.
Depends: R (>= 3.0)
License: GPL
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1.9000
Imports: pdp, ggplot2, ALEPlot, breakDown
Suggests: gbm, randomForest, xgboost
URL: https://pbiecek.github.io/DALEX/
BugReports: https://github.com/pbiecek/DALEX/issues
NeedsCompilation: no
Packaged: 2018-02-28 01:44:36 UTC; pbiecek
Author: Przemyslaw Biecek [aut, cre]
Maintainer: Przemyslaw Biecek <przemyslaw.biecek@gmail.com>
Repository: CRAN
Date/Publication: 2018-02-28 16:36:14 UTC
Built: R 3.4.3; ; 2018-04-03 03:04:04 UTC; unix

【讨论】：