在多类分类问题中处理不平衡数据答案

【问题标题】：Handling imbalanced data in multi-class classification problem在多类分类问题中处理不平衡数据
【发布时间】：2019-07-13 17:34:06
【问题描述】：

我有多类分类问题，数据严重倾斜。我的目标变量 (y) 有 3 个类，它们在数据中的百分比如下： - 0=3% - 1=90% - 2=7%

我正在寻找 R 中可以进行多类过采样、欠采样或这两种技术的包。

如果它在 R 中不可行，那么我可以在哪里处理这个问题。？

PS：我尝试在 R 中使用 ROSE 包，但它仅适用于二进制类问题。

【问题讨论】：

标签： python r

【解决方案1】：

caret-package 提供了广泛的 ML 算法，包括多类问题。

它还可以通过以下方式应用下采样和上采样方法：downSample()、upSample()

trainclass <- data.frame("label" = c(rep("class1", 100), rep("class2", 20), rep("class3", 180)),
                         "predictor1" = rnorm(300, 0 ,1),
                         "predictor2" = sample(c("this", "that"), 300, replace = TRUE))

> table(trainclass$label)
class1 class2 class3 
   100     20    180 

#then use
set.seed(234)
dtrain <- downSample(x = trainclass[, -1],
                     y = trainclass$label)

> table(dtrain$Class)
class1 class2 class3 
    20     20     20

不错的壮举：在应用重采样过程（例如交叉验证）时，它还可以进行下采样、上采样以及 SMOTE 和 ROSE

这使用下采样执行 10 倍交叉验证。

ctrl <- caret::trainControl(method = "cv",
                   number = 10,
                   verboseIter = FALSE,
                   summaryFunction = multiClassSummary
                   sampling = "down")

set.seed(42)
model_rf_under <- caret::train(Class ~ ., 
                               data = data,
                               method = "rf",
                               trControl = ctrl)

在此处查看更多信息： https://topepo.github.io/caret/subsampling-for-class-imbalances.html

还可以查看mlr-package： https://mlr.mlr-org.com/articles/tutorial/over_and_undersampling.html#sampling-based-approaches

【讨论】：

【解决方案2】：

您可以在 DMwR 包下使用 SMOTE 功能。我创建了一个示例数据集并制作了三个不平衡类..

install.packages("DMwR")
library(DMwR)

## A small example with a data set created artificially from the IRIS
## data 
data(iris)

#setosa 90%, versicolor 3% and virginica 7%
Species<-c(rep("setosa",135),rep("versicolor",5),rep("virginica",10))
data<-cbind(iris[,1:4],Species)
table(data$Species)

不平衡类：

setosa versicolor  virginica 
  135       5         10

现在，为了恢复 2 个不平衡类，对数据应用 SMOTE 函数 2 次...

First_Imbalence_recover <- DMwR::SMOTE(Species ~ ., data, perc.over = 2000,perc.under=100)

Final_Imbalence_recover <- DMwR::SMOTE(Species ~ ., First_Imbalence_recover, perc.over = 2000,perc.under=200)
table(Final_Imbalence_recover$Species)

期末余额类：

setosa versicolor  virginica 
    79         81         84

注意：这些示例将使用来自的信息生成少数类的每个示例的 k 个最近邻。这参数 k 控制使用这些邻居的数量。所以每次运行的类可能会有所不同，这不应该影响整体平衡。

【讨论】：

嗨 Sahidul 感谢您的回答。但是一个相关的问题 - 在这种情况下，其他软件包也应该是可能的。但它在技术上正确吗？
嗨 Sahidul，我用我的数据库尝试了这个。但它不起作用。当我运行代码时，它会引发错误 - 矩阵错误（if (is.null(value)) logical() else value, nrow = nr, dimnames = list(rn, : length of 'dimnames' [2] not等于数组范围
@Md. Sahidul Islam 当我在我的数据上运行此程序时，我收到此错误：the condition has length > 1 and only the first element will be usedthe condition has length > 1 and only the first element will be usedError in T[, col] <- data[, col] : incorrect number of subscripts on matrix 有人可以帮忙吗？