【问题标题】:Merging data sets based on Timestamp基于时间戳合并数据集
【发布时间】:2021-03-28 22:20:48
【问题描述】:

我有两个要合并的带有时间戳的数据集。对于df2X 列中的每个值,我想从df1 中获得JarTreatment 的值。从df1 我可以看到在特定时间测量了哪个Jar 以及Treatment 是什么。在df2 中,我可以看到X 在特定时间的值是多少,当我看到X 的特定值时,我需要知道Jar(以及哪个Treatment)被测量。

我对@9​​87654334@ 进行了一些尝试,但由于时间序列中存在间隙,这不起作用。例如。在 df2 中有一个值 X 在时间戳:2020-12-16 14:31:05,但此时间戳在 df1 中不存在。然而,基于 df1,我知道在这个时间戳 Jar=Soil_dryTreatment=None

关于如何制作一个表格,我可以在 df1 中为 df2 中的每个 X 值获取 JarTreatment 值的任何建议?

这里是 df1:

df1 <- structure(list(Jar = c("Soil_dry", "Soil_dry", "Soil_dry", "Soil_dry", 
"Soil_dry", "Soil_dry", "Soil_dry", "Soil_dry", "Soil_dry", "Soil_dry", 
"Soil_dry", "Soil_dry", "Soil_dry", "Soil_dry", "Soil_dry", "Soil_dry", 
"Soil_dry", "Soil_dry", "Soil_dry", "soil_wet", "soil_wet", "soil_wet", 
"soil_wet", "soil_wet", "soil_wet", "soil_wet", "soil_wet", "soil_wet", 
"soil_wet", "soil_wet", "soil_wet", "soil_wet", "soil_wet", "soil_wet", 
"soil_wet", "soil_wet", "soil_wet", "soil_wet", "soil_wet", "Soil_dry", 
"Soil_dry", "Soil_dry", "Soil_dry", "Soil_dry", "Soil_dry", "Soil_dry", 
"Soil_dry", "Soil_dry", "Soil_dry", "Soil_dry", "Soil_dry", "Soil_dry", 
"Soil_dry", "soil_wet", "soil_wet", "soil_wet", "soil_wet", "soil_wet", 
"soil_wet", "soil_wet", "soil_wet", "soil_wet", "soil_wet", "soil_wet", 
"soil_wet", "soil_wet", "soil_wet", "soil_wet"), Treatment = c("None", 
"None", "None", "None", "None", "None", "None", "None", "None", 
"None", "None", "None", "None", "None", "None", "None", "None", 
"None", "None", "None", "None", "None", "None", "None", "None", 
"None", "None", "None", "None", "None", "None", "None", "None", 
"None", "None", "None", "None", "None", "None", "ul5", "ul5", 
"ul5", "ul5", "ul5", "ul5", "ul5", "ul5", "ul5", "ul5", "ul5", 
"ul5", "ul5", "ul5", "ul5", "ul5", "ul5", "ul5", "ul5", "ul5", 
"ul5", "ul5", "ul5", "ul5", "ul5", "ul5", "ul5", "ul5", "ul5"
), Timestamp = structure(c(1608128877, 1608128933, 1608128997, 
1608129058, 1608129063, 1608129112, 1608129117, 1608129122, 1608129127, 
1608129238, 1608129243, 1608129348, 1608129353, 1608129358, 1608129363, 
1608129368, 1608129373, 1608129473, 1608129478, 1608129483, 1608129488, 
1608129598, 1608129603, 1608129717, 1608129723, 1608129837, 1608129842, 
1608129957, 1608129962, 1608130072, 1608130077, 1608130082, 1608130192, 
1608130197, 1608130202, 1608130318, 1608130323, 1608130418, 1608130423, 
1608130428, 1608130492, 1608130497, 1608130502, 1608130507, 1608130612, 
1608130617, 1608130622, 1608130627, 1608130732, 1608130737, 1608130742, 
1608130747, 1608130853, 1608130858, 1608130863, 1608130978, 1608130983, 
1608131093, 1608131098, 1608131103, 1608131213, 1608131218, 1608131223, 
1608131337, 1608131343, 1608131457, 1608131462, 1608131467), class = c("POSIXct", 
"POSIXt"), tzone = "UTC")), row.names = c(NA, -68L), class = "data.frame")

df2:

df2 <-structure(list(X = c(5L, 3L, 34L, 4L, 65L, 5L, 7L, 5L, 8L, 9L, 
8L, 5L, 78L, 9L, 5L, 78L, 9L, 5L, 78L, 9L, 5L, 7L, 4L, 34L, 8L, 
5L, 4L, 9L, 78L, 59L, 5L, 5L, 6L, 3L, 3L, 7L, 5L, 47L, 2L, 67L, 
34L, 76L, 5L, 76L, 5L, 6L, 5L, 7L, 2L, 13L, 1L, 54L, 32L, 4L, 
3L, 45L, 1L, 1L), Timestamp = structure(c(1608129065, 1608129122, 
1608129127, 1608129238, 1608129263, 1608129288, 1608129353, 1608129358, 
1608129363, 1608129368, 1608129373, 1608129473, 1608129478, 1608129483, 
1608129488, 1608129598, 1608129663, 1608129717, 1608129723, 1608129831, 
1608129842, 1608129957, 1608129962, 1608130072, 1608130073, 1608130082, 
1608130132, 1608130197, 1608130202, 1608130318, 1608130323, 1608130418, 
1608130423, 1608130428, 1608130492, 1608130497, 1608130502, 1608130507, 
1608130612, 1608130617, 1608130622, 1608130627, 1608130732, 1608130737, 
1608130742, 1608130747, 1608130853, 1608130858, 1608130863, 1608130978, 
1608130983, 1608131093, 1608131098, 1608131103, 1608131213, 1608131218, 
1608131223, 1608131337), tzone = "UTC", class = c("POSIXct", 
"POSIXt"))), row.names = c(NA, -58L), class = "data.frame")

【问题讨论】:

  • 由于 df1 和 df2 只有“时间戳”列共有,您是否尝试通过“时间戳”列合并两者?
  • merge(df1, df2) 能给你想要的吗?
  • merge(df1, df2) 的问题是缺少时间戳。例如,df2 在 2020-12-16 14:27:52 的值为 X,但由于 df1 中不存在此时间戳,因此 merge(df1, df2) 的结果中不存在 X 测量值。

标签: r tidyverse


【解决方案1】:

为什么df3 &lt;- df1 %&gt;% full_join(df2, by = 'Timestamp') 没有产生你想要的结果?

另外,你说

当我看到 X 的特定值时,我需要知道测量的是什么 Jar/Treatment。

在某些情况下,对于给定的 X 值,您有多个 Timestamp 值。换句话说,您无法避免为每个 X 值获取多个 Jar/Treatment 度量值。

例子:

Timestamp Jar Treatment X.df1 X.df2
2020-12-16 15:03:03 soil_wet ul5 1 1
2020-12-16 15:07:03 soil_wet ul5 1 1
2020-12-16 15:08:57 soil_wet ul5 1 1
2020-12-16 14:56:52 Soil_dry ul5 2 2
2020-12-16 15:01:03 soil_wet ul5 2 2
2020-12-16 14:32:02 Soil_dry None 3 3
2020-12-16 14:53:48 Soil_dry ul5 3 3
2020-12-16 14:54:52 Soil_dry ul5 3 3
2020-12-16 15:06:53 soil_wet ul5 3 3
2020-12-16 15:11:02 soil_wet ul5 3 3

【讨论】:

  • 感谢您的评论@Kristian Jr.!使用您的解决方案,并非 df2 中的所有 X 值都在最终表中。为每个 X 值获取多个 Jar/Treament 不是问题。
  • @Tiptop,full_join 完全符合您的描述:您将获得 X 的所有值。尝试将您的 df 减少到几个值,您知道 X 将匹配几个 jar/治疗值和您知道的 X 将不匹配。通过这种方式,您将看到所有值都存在 - 匹配或不匹配。 :) 或者我不明白这些问题。试着做一个你想要的结果 df 的例子。
  • 真的吗?明天我有笔记本电脑时会检查。我会得到 X 的所有值是的。但在某些行中,我将获得 X 的值,但 Jar/Treatment 的值为 NA。例如。在 df2 中,时间戳为 X:2020-12-16 14:31:05,但 df1 中不存在此时间戳,因此我将在 Jar/Treatment 中获得 NA。然而,基于 df1,我知道此时和日期 Jar=Soil_dry 和 Treatment=None。
  • 但是我会按照你明天的建议检查一个小数据集!
  • 我认为这确实有效:left_join(df2, df1) %>% fill(Jar, Treatment)
【解决方案2】:

您可以通过merge(如前所述)或dplyr 实现此目的:

left_join(df2, df1, "Timestamp")
#    X           Timestamp      Jar Treatment
# 1  34 2020-12-16 14:27:52     <NA>      <NA>
# 2   5 2020-12-16 14:28:54     <NA>      <NA>
# 3   5 2020-12-16 14:22:57     <NA>      <NA>
# 4  24 2020-12-16 14:30:34     <NA>      <NA>
# 5  45 2020-12-16 14:31:03 Soil_dry      None
# 6  66 2020-12-16 14:31:52 Soil_dry      None

【讨论】:

  • 感谢您的评论@Ben。我想我在描述我的问题方面做得很好。 join 解决方案的问题在于,我还想知道当您制作的表格中的 1:4 行中 X 为 34、5、5、24 时,Jar 和 Treatment 是什么。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-03-02
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2023-01-22
  • 2016-04-25
相关资源
最近更新 更多