【发布时间】:2019-03-16 19:14:34
【问题描述】:
我发现一个问题很难用 MRE 来解释 回答的方式,主要是因为我不完全理解问题出在哪里 我。所以我很抱歉前言含糊。
我有一个包含许多样本和参考测量值的小标题,我想要
为每个样本做一些线性插值。我现在通过取出来做到这一点
所有参考测量值,使用
approx,然后把它补回来。但是因为我先把它拿出来,我
不能以 group_by dplyr 管道方式很好地做到这一点。现在我用
非常丑陋的解决方法,我将新创建的空(NA)列添加到
示例 tibble,然后使用 for 循环进行。
所以我的问题是:如何在组内实现 approx 部分
进入管道,以便我可以在组内做所有事情?我实验过
使用dplyr::do(),并在“使用 dplyr 编程”中遇到了小插图,但是
搜索主要给了我broom::augment 和lm 我认为有效的东西
不同...(例如,见
Using approx() with groups in dplyr)。这个线程似乎也很有希望:How do you use approx() inside of mutate_at()?
irc 上有人建议使用条件突变,使用case_when,但我
还不完全了解在这种情况下的位置和方式。
我认为问题在于我想过滤掉部分数据 对于以下 mutate 操作,但 mutate 操作依赖于 我刚刚过滤掉的分组数据,如果这有意义的话。
这是一个 MWE:
library(tidyverse) # or just dplyr, tibble
# create fake data
data <- data.frame(
# in reality a dttm with the measurement time
timestamp = c(rep("a", 7), rep("b", 7), rep("c", 7)),
# measurement cycle, normally 40 for sample, 41 for reference
cycle = rep(c(rep(1:3, 2), 4), 3),
# wheather the measurement is a reference or a sample
isref = rep(c(rep(FALSE, 3), rep(TRUE, 4)), 3),
# measurement intensity for mass 44
r44 = c(28:26, 30:26, 36, 33, 31, 38, 34, 33, 31, 18, 16, 15, 19, 18, 17)) %>%
# measurement intensity for mass 45, normally also masses up to mass 49
mutate(r45 = r44 + rnorm(21, 20))
# of course this could be tidied up to "intensity" with a new column "mass"
# (44, 45, ...), but that would make making comparisons even harder...
# overview plot
data %>%
ggplot(aes(x = cycle, y = r44, colour = isref)) +
geom_line() +
geom_line(aes(y = r45), linetype = 2) +
geom_point() +
geom_point(aes(y = r45), shape = 1) +
facet_grid(~ timestamp)
# what I would like to do
data %>%
group_by(timestamp) %>%
do(target_cycle = approx(x = data %>% filter(isref) %>% pull(r44),
y = data %>% filter(isref) %>% pull(cycle),
xout = data %>% filter(!isref) %>% pull(r44))$y) %>%
unnest()
# immediately append this new column to the original dataframe for all the
# samples (!isref) and then apply another approx for those values.
# here's my current attempt for one of the timestamps
matchref <- function(dat) {
# split the data into sample gas and reference gas
ref <- filter(dat, isref)
smp <- filter(dat, !isref)
# calculate the "target cycle", the points at which the reference intensity
# 44 matches the sample intensity 44 with linear interpolation
target_cycle <- approx(x = ref$r44,
y = ref$cycle, xout = smp$r44)
# append the target cycle to the sample gas
smp <- smp %>%
group_by(timestamp) %>%
mutate(target = target_cycle$y)
# linearly interpolate each reference gas to the target cycle
ref <- ref %>%
group_by(timestamp) %>%
# this is needed because the reference has one more cycle
mutate(target = c(target_cycle$y, NA)) %>%
# filter out all the failed ones (no interpolation possible)
filter(!is.na(target)) %>%
# calculate interpolated value based on r44 interpolation (i.e., don't
# actually interpolate this value but shift it based on the 44
# interpolation)
mutate(r44 = approx(x = cycle, y = r44, xout = target)$y,
r45 = approx(x = cycle, y = r45, xout = target)$y) %>%
select(timestamp, target, r44:r45)
# add new reference gas intensities to the correct sample gasses by the target cycle
left_join(smp, ref, by = c("time", "target"))
}
matchref(data)
# and because now "target" must be length 3 (the group size) or one, not 9
# I have to create this ugly for-loop
# for which I create a copy of data that has the new columns to be created
mr <- data %>%
# filter the sample gasses (since we convert ref to sample)
filter(!isref) %>%
# add empty new columns
mutate(target = NA, r44 = NA, r45 = NA)
# apply matchref for each group timestamp
for (grp in unique(data$timestamp)) {
mr[mr$timestamp == grp, ] <- matchref(data %>% filter(timestamp == grp))
}
【问题讨论】:
-
当样本值超出参考范围时会发生什么?例如,在
timestampa中,引用范围从 27 到 30,但您的值是r44,即 26。应该外推还是返回 `NA? -
我认为它应该返回 NA。否则我可能会使用
Hmisc::approxExtrap。
标签: r dplyr linear-interpolation