【问题标题】：BY operation on data.table without aggregation对 data.table 进行 BY 操作而不进行聚合
【发布时间】：2020-06-10 23:33:04
【问题描述】：

我有这个问题我想分享，因为我花了很多时间和麻烦来解决。我有一个充满噪声 EEG 信号的 data.table，我想在绘图之前将其带通。我将所有参与者信号连同它们的许多内部因素放在了一个 R 数据表中。

我的数据集有

x = 时间
y = 值
参与者姓名 = p_name
因素 1 和因素 2

问题是我想直接从 data.table 转到 ggplot2，而不必使用 6 个叠层 for 循环在所有条件下对所有数据进行带通。

我在这里提出了一个解决方案，它包括“忘记”要解析的因素/维度。因此，而不是经典语法：

DF[, m:=mean(value),
    by=.(p_name,factor1,factor2,time))

将时间因素从by 中剔除允许跨时间对所有值执行操作（在我的情况下为频率过滤器），返回一个直接应用于每个值的列表。

# myfunc(x) returns x filtered for high frequencies
DF[, value_filt:=my_func(value), 
    by=.(p_name,factor1,factor2)) # <- time not in the list

很酷的事实，时间因素甚至不会在此过程中丢失。

是否有更好/更快的解决方案？

复制代码

library(data.table)
library(signal)
library(ggplot2)

# Operations on data.table without aggregation

set.seed(1234)
fs = 128 # Hz sampling rate (=length of 1 sec vector)
tseq <- seq(0, .999, by = 1/fs) # t = 128 samples for 1 second 

# For generating signal 128 I created a function 
# that mixes together two sin waves + noise
generate_sig = function(t) {
  x <- sin(rnorm(1)*40*pi*t*.5) + 0.11*rnorm(length(t)) + sin(rnorm(1)*40*pi*t*.5) + 0.31*rnorm(length(t))  # create two random sinusoid+noise
  return(x)
}

# Testing the function
x = generate_sig(tseq)
plot(NA,NA,xlim=c(0,128),ylim=c(-pi,pi),xlab='t',ylab='signal ampitude')
lines(x,col='red')

# Generating a Butterworth filter 
b = butter(2,c(1,15)*(2/fs))

# Applying the filter
xfil = filtfilt(b,x)
# Plotting
lines(xfil,col='black')

生成data.table数据

在一个长表中，其中包含两个参与者在因子 1 和因子 2 条件下的信号

val_pname=c('p1', 'p2')
val_factor1=c('left','right')
val_factor2=c('pain', 'reward', 'sham')
nb_samples = length(tseq)
col_pname = factor(rep(c(val_pname),each=length(val_factor1)*length(val_factor2)*nb_samples))
col_factor1 = factor(rep(rep(c(val_factor1),each=length(val_factor2)*nb_samples),length(val_pname)))
col_factor2 = factor(rep(rep(rep(c(val_factor2),each=nb_samples),length(val_factor1)),length(val_pname)))
col_t= rep(rep(rep(tseq,length(val_factor2)),length(val_factor1)),length(val_pname))
col_values = replicate(length(val_factor2)*length(val_factor1)*length(val_pname),generate_sig(tseq))
col_values = as.numeric(as.list(col_values))
df = data.table(participant=col_pname,factor1=col_factor1,factor2=col_factor2,t=col_t,t_idx=col_t_idx,val=col_values)

# visualizing the whole data table
ggplot(df,aes(x=t, y=val, color=factor1))+
  geom_line()+
  facet_grid(factor2~participant)+
  theme_bw()

现在主要问题：

我想直接从数据表中对我的数据进行带通信号，而不使用 for 循环。实际上，我有大约 6 个不同的因素，我希望能够汇总它们（平均）在进行带通滤波之前，请按照我的意愿进行

解决方案

# I managed to work it out this way
df[,val2:=filtfilt(b,val), by=.(participant,factor1,factor2)]

没有时间‘t’因子的聚合，因此输入表与输出表的大小相同。

可视化解决方案

# visualizing the filtered data table 
ggplot(df,aes(x=t, y=val2, color=factor1))+
  geom_line()+
  facet_grid(factor2~participant)+
  theme_bw()

【问题讨论】：

标签： r data.table signal-processing aggregation bandpass-filter

【解决方案1】：

不改变原始数据集，保持时间维度：

filtered_df = df[,.(val=filtfilt(b,val),t=t) by=.(participant,factor1,factor2)]

【讨论】：

复制代码

生成data.table数据

现在主要问题：

****解决方案****

可视化解决方案

解决方案