【问题标题】:Filling a matrix efficiently有效地填充矩阵
【发布时间】:2016-11-06 08:48:57
【问题描述】:

我的数据是在一组医院中测量一组受试者使用一组药物治疗的结果。 (#drugs > #subjects > #hospitals)

subjects <- paste("S",1:100,sep="_")
drugs <- paste("D",1:1000,sep="_")

我的data.frame 在每一行都有一个drugsubjecthospitaloutcome 组合:

df <- expand.grid(subject=subjects,drug=drugs,stringsAsFactors=F)
hospitals <- paste("H",1:10,sep="_")
df$hospital <- rep(sapply(hospitals,function(h) rep(h,10)),200)
set.seed(1)
df$outcome <- runif(nrow(df),0,100)

现在我想构建一个matrix,其中每一行都是唯一的hospital subject 组合,每一列都是唯一的hospital drug 组合。这可能是构建此矩阵的一种不是最有效的方法:

df$hospital.subject <- paste(df$hospital,df$subject,sep=":")
df$hospital.drug <- paste(df$hospital,df$drug,sep=":")

hospital.subject <- unique(paste(df$hospital,df$subject,sep=":"))
hospital.drug <- unique(paste(df$hospital,df$drug,sep=":"))

mat <- do.call(rbind,lapply(hospital.subject, function(x){
  hospital.subject.df <- dplyr::filter(df,hospital.subject==x)
  res <- rep(NA,length(hospital.drug))
  match.idx <- match(hospital.drug,hospital.subject.df$hospital.drug)
  res[which(!is.na(match.idx))] <- hospital.subject.df$outcome[match.idx[which(!is.na(match.idx))]]
  return(res)
}))
rownames(mat) <- hospital.subject
colnames(mat) <- hospital.drug

所以问题 #1 是如果可能的话,如何更有效地构建这个矩阵。

现在,由于矩阵稀疏,我想根据hospital.drug 组合在其hospital.drug 组合中用缺失值来估算每个hospital.subject 组合,即未观察到这些subjects在其中观察到它们,从具有mean = mediansd = mad 的这些观察到的hospital.subject 组合的正态分布。

换句话说,例如对于仅在hospitals[1] 中观察到的subjects[1:10],请从hospitals[1] 中为每个相应的drug 填写hospitals[2:10]。这意味着:

mat[1:10,2:10] &lt;- rnorm(90,median(mat[1:10,1]),mad(mat[1:10,1]))

mat[1:10,12:20] &lt;- rnorm(90,median(mat[1:10,1]),mad(mat[1:10,1]))

这样一个和下一个医院(垫子中的行),例如,

mat[31:40,2:10] &lt;- rnorm(90,median(mat[31:40,1]),mad(mat[31:40,1]))

mat[31:40,12:20] &lt;- rnorm(90,median(mat[31:40,1]),mad(mat[31:40,1]))

使用for 循环我会这样做:

for(h in 1:length(hospitals)){
  row.idx <- which(grepl(paste0(hospitals[h],":"),hospital.subject)==T)
  col.idx <- which(grepl(paste0(hospitals[h],":"),hospital.drug)==T)
  for(i in 1:length(col.idx)){
    drug <- strsplit(hospital.drug[col.idx[i]],split=":")[[1]][2]
    impute.idx <- which(grepl(paste0(":",drug,"$"),hospital.drug,perl=T)==T)[-col.idx[i]]
    mat[row.idx,impute.idx] <- rnorm(length(row.idx)*length(impute.idx),mean=median(mat[row.idx,col.idx[i]]),sd=mad(mat[row.idx,col.idx[i]]))
  }
}

有没有更高效、更优雅的方式来实现这一点?

还有一件事,我的真实数据没有这个例子那么好,因为每家医院的受试者数量并不相同,此外还有不止一家医院接受相同药物治疗的受试者。

【问题讨论】:

    标签: r matrix dataframe dplyr reshape2


    【解决方案1】:

    这是你想要的吗?

    df$hos.sub=paste(df$hospital,df$subject)
    df$hos.dru=paste(df$hospital,df$drug)
    
    ind1 <- list(factor(df$hos.sub),factor(df$hos.dru))
    res<-tapply(df[,"outcome"],ind1,mean)
    head(res[,1:10])
    
    > head(res[,1:9])
               H_1 D_1  H_1 D_10 H_1 D_100 H_1 D_1000 H_1 D_101  H_1 D_102 H_1 D_103 H_1 D_104 H_1 D_105
    H_1 S_1  26.550866 83.189899  6.516364   45.77171  6.471249 26.6257392  81.14044  9.088058  67.64499
    H_1 S_10  6.178627  4.288589 45.675309   77.90078  3.338293 95.5751769  92.02642 49.810641  14.31814
    H_1 S_2  37.212390 76.684275 27.743618   21.32599 67.661240 66.0476814  82.46891 97.271288  88.86986
    H_1 S_3  57.285336 27.278032 60.041069   55.22206 73.537169 21.2416518  91.60083 85.267414  95.01507
    H_1 S_4  90.820779 18.816330 27.314448   13.21052 11.129967  0.5266102  72.34151 49.899330  91.69972
    H_1 S_5  20.168193 22.576183 94.148905   44.60504  4.665462 10.2902506  91.02545 27.440370  90.51900
    

    【讨论】:

    • 我认为这不是我在问题中描述的方式
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-05-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-03-22
    相关资源
    最近更新 更多