【发布时间】:2023-12-25 13:20:01
【问题描述】:
我有以下data.frame:
>str(customerduration_data)
Classes 'tbl_df', 'tbl' and 'data.frame': 4495 obs. of 4 variables:
$ monthofgateOUT : Ord.factor w/ 4 levels "8"<"9"<"10"<"11": 1 1 1 1 1 1 1 1 1 1 ...
$ dayofgateOUT : Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 4 5 1 1 1 1 1 2 2 3 ...
$ timeofgateOUT : Ord.factor w/ 20 levels "3"<"4"<"5"<"6"<..: 13 4 2 3 3 11 15 10 13 14 ...
$ durationCUST_hours: num 95.63 5.73 10.73 10.2 14.4 .
我想使用以下命令将此数据拆分为训练集和测试集:
install.packages("caTools")
library (caTools)
set.seed(6)
customerduration_data$spl=sample.split(customerduration_data,SplitRatio=0.7)
但是,运行上述命令后,出现以下错误:
>Error in `$<-.data.frame`(`*tmp*`, spl, value = c(TRUE, FALSE, FALSE, :
replacement has 4 rows, data has 4495
我该如何解决这个问题?
【问题讨论】:
-
请提供代码以制作可重现的数据
-
>install.packages("caTools") >library (caTools) >customerduration_data% select(monthofgateOUT, dayofgateOUT, timeofgateOUT, durationCUST_hours) %>% mutate(durationCUST_hours=as.numeric (durationCUST_hours) ) > set.seed(6) > customerduration_data$spl=sample.split(customerduration_data,SplitRatio=0.7)
-
请补充
sample.split来自library(caTools)。借助函数:Split data from vector Y into two sets in predefined ratio while preserving relative ratios of different labels in Y.。你给它一个数据框。因此出现错误。 -
我确实在代码之前添加了库(caTools)来拆分数据...
-
如果您尝试将数据框拆分为两个随机数据块,分别占数据的 30% 和 70%,我会使用基本 R:
df$spl <- sample(c(rep("Test", floor(0.7*4495)), rep("Train", 4495-floor(0.7*4495))), replace = F)
标签: r syntax-error training-data