在 R 中将数据框拆分为训练集和测试集答案

【问题标题】：Splitting a data frame into training and testing sets in R在 R 中将数据框拆分为训练集和测试集
【发布时间】：2023-12-25 13:20:01
【问题描述】：

我有以下data.frame：

>str(customerduration_data)

Classes 'tbl_df', 'tbl' and 'data.frame':   4495 obs. of  4 variables:

$ monthofgateOUT    : Ord.factor w/ 4 levels "8"<"9"<"10"<"11": 1 1 1 1 1 1 1 1 1 1 ...

$ dayofgateOUT      : Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 4 5 1 1 1 1 1 2 2 3 ...

$ timeofgateOUT     : Ord.factor w/ 20 levels "3"<"4"<"5"<"6"<..: 13 4 2 3 3 11 15 10 13 14 ...

$ durationCUST_hours: num  95.63 5.73 10.73 10.2 14.4 .

我想使用以下命令将此数据拆分为训练集和测试集：

install.packages("caTools")

library (caTools)

set.seed(6)

customerduration_data$spl=sample.split(customerduration_data,SplitRatio=0.7)

但是，运行上述命令后，出现以下错误：

>Error in `$<-.data.frame`(`*tmp*`, spl, value = c(TRUE, FALSE, FALSE,  : 
  replacement has 4 rows, data has 4495

我该如何解决这个问题？

【问题讨论】：

请提供代码以制作可重现的数据
>install.packages("caTools") >library (caTools) >customerduration_data% select(monthofgateOUT, dayofgateOUT, timeofgateOUT, durationCUST_hours) %>% mutate(durationCUST_hours=as.numeric (durationCUST_hours) ) > set.seed(6) > customerduration_data$spl=sample.split(customerduration_data,SplitRatio=0.7)
请补充 sample.split 来自 library(caTools)。借助函数：Split data from vector Y into two sets in predefined ratio while preserving relative ratios of different labels in Y. 。你给它一个数据框。因此出现错误。
我确实在代码之前添加了库（caTools）来拆分数据...
如果您尝试将数据框拆分为两个随机数据块，分别占数据的 30% 和 70%，我会使用基本 R：df$spl <- sample(c(rep("Test", floor(0.7*4495)), rep("Train", 4495-floor(0.7*4495))), replace = F)

标签： r syntax-error training-data

【解决方案1】：

作为替代方案，您可以使用 base R，这会产生更快的选择（根据microbenchmark 是 3.4 倍）并且不需要额外的包：

df$spl <- sample(c(rep(TRUE, floor(0.7*4495)), rep(FALSE, 4495-floor(0.7*4495))), replace = F)

将其拆分为数据集：

df$spl <- sample(c(rep(TRUE, floor(0.7*4495)), rep(FALSE, 4495-floor(0.7*4495))), replace = F)
test_data  <- df[df[,'spl'] %in% TRUE, ]
train_data <- df[df[,'spl'] %in% FALSE, ]

【讨论】：

我的测试给出了更大的渗透，4.6。 (Windows 7/R 3.5.0)。
您无需与TRUE 比较，只需df[,'spl'] 即可。 !df[,'spl'] 代表 FALSE 部分。

【解决方案2】：

函数sample.split 需要一个向量。这里有一个简单的方法来实现这一点：

library(caTools)
customerduration_data$spl <- sample.split(seq_len(nrow(customerduration_data)), 
                                          SplitRatio = 0.7)

【讨论】：

谢谢！当我尝试以下操作时，我遇到了另一个错误： > vars str(customerduration_data[, c(vars, "durationCUST_hours")]) > train test tree_mod
@FleurLolkema 您应该将此作为单独的问题发布。查看上面str() 的输出，您的结果似乎是数字。通常，人们会创建一个具有合理数量级别的类型因子的新列，其中“合理”需要领域知识。

【解决方案3】：

您正在原始 data.frame 中创建索引列。如果要将 df 拆分为两组，train 和 test，可以执行以下操作。

library(caTools)

set.seed(6)    # make the results reproducible

inx <- sample.split(seq_len(nrow(customerduration_data)), 0.7)
train <- customerduration_data[inx, ]
test <-  customerduration_data[!inx, ]

这不会创建列spl。为了创建它，请使用@RalfStubner 的答案。

编辑。

另一种方法是将sample 与概率向量一起使用。

inx2 <- sample(c(FALSE, TRUE), 4495, replace = TRUE, prob = c(0.3, 0.7))

到目前为止测试了三个解决方案，我得到了以下结果。

microbenchmark::microbenchmark(
  base_griffinevo = sample(c(rep(TRUE, floor(0.7*4495)), rep(FALSE, 4495-floor(0.7*4495))), replace = F),
  base_Rui = sample(c(FALSE, TRUE), 4495, replace = TRUE, prob = c(0.3, 0.7)),
  caTools_Ralf = sample.split(seq_len(nrow(customerduration_data)), 0.7)
)
#Unit: microseconds
#            expr     min       lq      mean  median        uq      max neval
# base_griffinevo 177.072 183.7665  219.3547 195.147  239.6660  523.851   100
#        base_Rui  89.708  93.2225  119.4083 119.666  134.5615  253.389   100
#    caTools_Ralf 838.495 861.4235 1103.0870 926.361 1313.1390 3634.478   100

所以更简单的基本 R 方式也是最快的。

【讨论】：

【解决方案4】：

这是使用caret 包及其createDataPartition() 函数的替代方案。我们将使用 Applied Predictive Modeling 包中的阿尔茨海默病数据来说明测试和训练数据集的创建。

library(AppliedPredictiveModeling)
library(caret)
data(AlzheimerDisease)
adData <- data.frame(diagnosis, predictors)
# count rows in data frame
nrow(adData)
trainIndex <- createDataPartition(diagnosis, p = .75,list=FALSE)
training <- adData[trainIndex,]
testing <- adData[-trainIndex,]
# rows in training data frame
nrow(training)
# rows in testing data frame 
nrow(testing)

...和输出：

> library(AppliedPredictiveModeling)
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> data(AlzheimerDisease)
> adData <- data.frame(diagnosis, predictors)
> # count rows in data frame
> nrow(adData)
[1] 333
> trainIndex <- createDataPartition(diagnosis, p = .75,list=FALSE)
> training <- adData[trainIndex,]
> testing <- adData[-trainIndex,]
> # rows in training data frame
> nrow(training)
[1] 251
> # rows in testing data frame 
> nrow(testing)
[1] 82
>

【讨论】：