【问题标题】:merge multiple dataframes based on matching timestamp根据匹配的时间戳合并多个数据帧
【发布时间】:2015-09-09 02:33:48
【问题描述】:

我有 6 个数据框,所有数据框都具有唯一的列名、相同数量的列,并且数据是在同一时间段内收集的。

每个数据帧都有一个时间戳和分钟平均值,但有些数据帧缺少数据并且列长度不相等。

我想合并数据帧以并排显示所有 6 个数据帧,但仅有时该数据存在于所有 6 个数据帧中,即列数最少的 df,即“H1_min”

> head(H1_min)
            h1min h1temp h1humid   h1db     h1hz
1 2015-09-06 00:00:00   21.5   73.10 39.252 117.1900
2 2015-09-06 00:02:00   21.5   72.50 39.434 125.0000
3 2015-09-06 00:03:00   21.5   72.65 39.338 127.9325
4 2015-09-06 00:04:00   21.5   73.00 39.206 148.4400
5 2015-09-06 00:06:00   21.5   73.00 39.253 144.5350
6 2015-09-06 00:07:00   21.5   72.30 39.293 156.2500

其他数据帧的列名相似,但 H1 = H2 到 H6。

dput(head(H2_min))

"2015-09-08 20:21:00", "2015-09-08 20:22:00", "2015-09-08 20:23:00", 
"2015-09-08 20:24:00", "2015-09-08 20:25:00", "2015-09-08 20:26:00", 
"2015-09-08 20:27:00", "2015-09-08 20:28:00", "2015-09-08 20:29:00", 
"2015-09-08 20:30:00", "2015-09-08 20:31:00", "2015-09-08 20:32:00", 
"2015-09-08 20:33:00", "2015-09-08 20:34:00", "2015-09-08 20:35:00"
), class = "factor"), h2temp = c(23.4, 23.4, 23.3, 23.2, 23.2, 
23.1), h2humid = c(38.5, 38.3, 38.05, 38.1, 38.6, 38.6), h2db = c(38.834, 
38.655, 38.679, 38.695, 38.806, 38.702), h2hz = c(191.41, 152.34, 
162.11, 113.28, 121.09, 164.06)), .Names = c("h2min", "h2temp", 
"h2humid", "h2db", "h2hz"), row.names = c(NA, 6L), class = "data.frame")

dput(head(H4_min))

"2015-09-08 17:10:00", "2015-09-08 17:11:00", "2015-09-08 17:12:00", 
"2015-09-08 17:13:00"), class = "factor"), h4temp = c(27.2, 27.2, 
27.2, 27.2, 27.2, 27.2), h4humid = c(33.5, 33.5, 33.5, 33.5, 
33.5, 33.5), h4db = c(36.8225, 36.921, 36.8766666666667, 36.91, 
36.8336666666667, 36.768), h4hz = c(134.765, 136.068333333333, 
137.373333333333, 126.3, 139.323333333333, 128.906666666667)), .Names =       
c("h4min", "h4temp", "h4humid", "h4db", "h4hz"), row.names = c(NA, 6L), class = "data.frame")

这种尝试产生了:

H_min<-merge(H1_min, H2_min, H3_min, H4_min, H5_min, H6_min, by.x = 'row.names', by.y ='h1_min')

Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column

【问题讨论】:

  • 带空格的数据很难输入。请提供dput(head(H1_min)) 的输出。额外数据框的此类输出也会有所帮助。
  • 当然,为第二个数据帧添加了它
  • @Evan 不是 dput 输出...应该以 structure(...
  • 这是输出的尾部,因为输出太大而无法滚动到顶部。你想看点别的吗?
  • @Evan 是的,还有别的。 dputfull 输出很有用,其他都是噪音。如有必要,使用 head 的第二个参数减少行数,但不要修剪 dput 报告的内容。

标签: r merge timestamp match


【解决方案1】:

另一种方法是将data.frames转换为xts对象,然后使用merge.xts(...),它会根据时间戳自动合并,然后将结果转换回data.frame。

下面的大部分代码只是为了创建可重现的示例数据。实际工作在最后的 6 行中。

# create representative example - you have this already
time <- as.character(as.POSIXct("2015-09-06") + 60*(0:30))
temp = c(23.4, 23.4, 23.3, 23.2, 23.2, 23.1)
humid = c(38.5, 38.3, 38.05, 38.1, 38.6, 38.6)
db = c(38.834, 38.655, 38.679, 38.695, 38.806, 38.702)
hz = c(191.41, 152.34, 162.11, 113.28, 121.09, 164.06)
set.seed(123)   # for reproducible example
get.df <- function(n, name) {
  df <- data.frame(min=sort(sample(time,n)), 
                   temp=sample(temp,n, replace=TRUE), 
                   humid=sample(humid,n,replace=TRUE),
                   db = sample(db,n,replace=TRUE),
                   hz = sample(hz,n,replace=TRUE))
  names(df) <- paste0(name,names(df))
  df
}
H1 <- get.df(20,"h1")    # 20 rows at random times
H2 <- get.df(20,"h2")    # 20 rows at random times
H3 <- get.df(25,"h3")    # 25 rows at random times
H4 <- get.df(30,"h4")    # 30 rows at random times
# you start here
library(xts)
lst <- list(H1, H2, H3, H4)
xts.lst <- lapply(lst, function(df) xts(df[,2:ncol(df)], order.by=as.POSIXct(df[[1]])))
result <- do.call(merge.xts, c(xts.lst, all=FALSE))
result <- data.frame(result)
head(result)
#                     h1temp h1humid   h1db   h1hz h2temp h2humid   h2db   h2hz h3temp h3humid   h3db   h3hz h4temp h4humid   h4db   h4hz
# 2015-09-06 00:03:00   23.2   38.05 38.679 162.11   23.4    38.5 38.695 121.09   23.3    38.3 38.702 191.41   23.4    38.5 38.679 162.11
# 2015-09-06 00:04:00   23.1   38.05 38.655 121.09   23.4    38.3 38.679 152.34   23.2    38.1 38.679 121.09   23.1    38.3 38.834 121.09
# 2015-09-06 00:09:00   23.2   38.50 38.679 162.11   23.4    38.5 38.655 113.28   23.3    38.3 38.834 191.41   23.4    38.6 38.655 191.41
# 2015-09-06 00:12:00   23.4   38.30 38.806 164.06   23.4    38.3 38.679 164.06   23.4    38.6 38.834 162.11   23.4    38.3 38.655 121.09
# 2015-09-06 00:13:00   23.4   38.60 38.679 152.34   23.2    38.6 38.655 164.06   23.3    38.6 38.679 162.11   23.4    38.5 38.679 121.09
# 2015-09-06 00:14:00   23.1   38.50 38.806 191.41   23.2    38.6 38.695 152.34   23.4    38.6 38.834 162.11   23.3    38.5 38.834 191.41

【讨论】:

  • 感谢您的回复!我实际上更喜欢 c(xts.lst, all=TRUE) 因为它会在传感器发生故障时显示间隙。再次感谢!
【解决方案2】:
library(dplyr)
library(magrittr)
library(tidyr)

H1_min = 
  data_frame(
    h1min = c("2015-09-06 00:00:00", "2015-09-06 00:02:00"),
    h1temp = c(21.5, 21.5),
    h1humid = c(73.10, 72.50),
    h1db = c(39.252, 39.434),
    h1hz = c(117.1900, 125.000) )

H2_min = H1_min %>% mutate(h1hz = c(117.1900, NA))

answer = 
  list(H1_min, H2_min) %>%
  lapply(. %>% setNames(c("min",
                          "temp",
                          "humid",
                          "db",
                          "hz"))) %>%
  bind_rows(.id = "location") %>%
  gather(variable, value, -location, -min) %>%
  mutate(prefix = "h") %>%
  unite(new_variable, prefix, location, variable, sep = "") %>%
  spread(new_variable, value) %>%
  filter(complete.cases(.))

【讨论】:

    【解决方案3】:

    根据@jlhoward 的回答来解决这个问题的更简单的方法。

    qxts1 <- xts(df1[,-1], order.by = df1[,1]) 
    qxts2 <- xts(df2[,-1], order.by = df2[,1])
    
    xts.lst = list(qxts1, qxts2)
    result <- do.call(merge.xts, c(xts.lst, all=FALSE))
    result <- data.frame(result)
    

    对于 xts 或 zoo,请确保您的 TimeStamp 是一个向量或矩阵,将数据作为 Date、POSIXct、chron、...

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-04-28
      • 1970-01-01
      • 2016-03-23
      • 1970-01-01
      • 2022-11-02
      • 1970-01-01
      • 2021-06-22
      相关资源
      最近更新 更多