【发布时间】:2020-12-30 10:47:24
【问题描述】:
我试图根据另一个数据帧中的信息对一个数据帧进行分类。在 df1 我有关于测量类型的信息(例如,如果一个罐子里有湿土还是干土,以及处理是否是“无”或“ul5”)在给定时间。在 df2 我有关于在给定时间测量值 X 的信息。我需要知道 X 的每个测量值的测量类型。
@Ronak Shah 在下面提出了这个很棒的解决方案,但是由于数据集很大,我收到此错误:无法分配大小为 56.2 Gb 的向量
library(dplyr)
tidyr::crossing(df1 %>%rename(Timestamp1 = Timestamp),
df2 %>% rename(Timestamp2 = Timestamp)) %>%
mutate(diff = as.numeric(Timestamp2 - Timestamp1)) %>%
filter(diff > 0) %>%
arrange(Jar, Timestamp2, diff) %>%
group_by(Timestamp2) %>%
slice(1L) %>%
ungroup %>%
arrange(Timestamp2) %>%
select(-diff)
关于如何合并大型数据集的任何想法?我有一个 ThinkPad intel Corei7 8th Gen,所以我的电脑不是很慢。
这里是 df1:
df1 <- structure(list(Jar = c("Soil_dry", "Soil_dry", "soil_wet", "soil_wet",
"Soil_dry", "Soil_dry", "soil_wet"), Treatment = c("None", "None",
"None", "None", "ul5", "ul5", "ul5"), Timestamp = structure(c(1608129063,
1608129122, 1608129126, 1608129136, 1608129189, 1608129242, 1608129252
), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L), spec = structure(list(
cols = list(Jar = structure(list(), class = c("collector_character",
"collector")), Treatment = structure(list(), class = c("collector_character",
"collector")), Timestamp = structure(list(format = ""), class = c("collector_datetime",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
df2:
df2 <- structure(list(X = c(5, 3, 34, 4, 65, 9, 7), Timestamp = structure(c(1608129064,
1608129122, 1608129125, 1608129133, 1608129188, 1608129240, 1608129243
), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L), spec = structure(list(
cols = list(X = structure(list(), class = c("collector_double",
"collector")), Timestamp = structure(list(format = ""), class = c("collector_datetime",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
所需数据:
desired_data <- structure(list(X = c(5, 3, 34, 4, 65, 9, 7), Timestamp = structure(c(1608129064,
1608129122, 1608129125, 1608129133, 1608129188, 1608129240, 1608129243
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Jar = c("Soil_dry",
"Soil_dry", "Soil_dry", "soil_wet", "soil_wet", "Soil_dry", "Soil_dry"
), Treatment = c("None", "None", "None", "None", "None", "ul5",
"ul5")), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -7L), spec = structure(list(cols = list(
X = structure(list(), class = c("collector_double", "collector"
)), Timestamp = structure(list(format = ""), class = c("collector_datetime",
"collector")), Jar = structure(list(), class = c("collector_character",
"collector")), Treatment = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
【问题讨论】:
-
您能否在您的实际
df1和df2中包含有关行数和列数的详细信息。 -
df1= 225000 obs。 50 个变量和 df2=67000 obs。 , 410 个变量
-
内存问题本质上是数据宽度的结果。换句话说,您遇到了不需要的问题,因为您携带了所有 460 个变量,而不仅仅是执行核心操作所需的六个变量。为每个数据创建一个索引
df1$id1 <- seq_len(nrow(df1)),然后仅在上面使用的列上使用select。然后根据需要使用列。 -
@Hugh,我已经尝试过了,但并没有解决问题......
-
225k 和 67k 行不是“大”