R：将按比例分配的行均值和选择缺失百分比作为新变量添加到 df答案

【问题标题】：R: add prorated rowmeans and percent missing of selection as new variables to dfR：将按比例分配的行均值和选择缺失百分比作为新变量添加到 df
【发布时间】：2020-08-16 01:18:30
【问题描述】：

这是一个复杂的问题...

我想计算一组变量的按比例分配的平均值。我还想计算相同选择的缺失变量的百分比。也就是说，如果 bhs_1:bhs4 是 1 2 3 NA，我希望看到按比例分配的平均值 m = 2，缺失百分比 = 0.25。我知道NCOL(x) 和length(x) 会给我x 的长度，但是不知道如何将所有这些都包装起来以获得我的结果。我想将此绑定到我的 df 以供以后分析。我有一个可行的解决方案，即：但是，我想反复这样做，所以在一个比一遍又一遍地重复更有效的解决方案之后。此外，我需要根据管理时间的不同变量计算 rowmean（protocol 是下面 df 中的时间变量）。具体来说，我有来自两个不同协议的数据，其中在协议 1 期间收集了变量 bhs_1:bhs_4，但是在协议 2 期间收集了变量 bhsSF_1:bhsSF_4。

还有一个转折点，我有一个部分是强制性的，部分是可选的措施。具体来说，msssi_1:mssi_4 是必填项，而mssi_5:mssi8 是可选项，取决于前者的答案。也就是说，如果参与者在前者上得分一定，则继续管理后者，否则停止。因此，这些分数的真正分数是选择长度的平均值（即 8 个变量），而不是按比例分配的平均值。所以NA 很重要，但它们有时或多或少等于零，但并不总是因为它们实际上可能是NA！我希望这是有道理的......

一个整洁的解决方案会更可取，但是基本版本也可以，因为我希望有一天能将其变为一项功能，因为我需要能够定期执行此操作。

df <- df %>%
    select(bhs_1:bhs_4) %>%
    rowMeans(., na.rm = TRUE) %>%
    round(., digits = 2) %>%
    bind_cols(my_data, bhs_mean = .)

## this works to calculate the number missing from the selected variables
df %>%
    select(bhs_1:bhs_4) %>%
    apply(., MARGIN = 1, function(x) sum(is.na(x)))
## just not sure how to bind this as a new variables based on the number of NAs
## divided by length of selection
## I now that NCOL(x) and length(x) will give me the number of rows in the selection, but how
## do I use this to calculate the percentage?

最小的数据集。

structure(list(protocol = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, NA
), uci = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, NA), pa_1 = c(NA,
2L, NA, 5L, NA, 2L, NA, 5L, NA), pa_2 = c(NA, 4L, NA, 5L, NA,
4L, NA, 5L, NA), pa_3 = c(NA, 2L, NA, 5L, NA, NA, NA, 5L, NA),
    pa_4 = c(NA, 5L, NA, 5L, NA, 5L, NA, 5L, NA), dass_1 = c(1L,
    1L, 2L, 3L, NA, 1L, 2L, 3L, NA), dass_2 = c(1L, 1L, 2L, 2L,
    1L, 1L, 2L, NA, NA), dass_3 = c(2L, 2L, NA, 3L, 2L, 2L, NA,
    NA, NA), dass_4 = c(1L, 3L, 0L, 3L, 1L, 3L, NA, NA, NA),
    bhsSF_1 = c(NA, 1L, NA, 5L, NA, 1L, NA, 5L, NA), bhsSF_2 = c(NA,
    3L, NA, 6L, NA, 3L, NA, NA, NA), bhsSF_3 = c(NA, 3L, NA,
    6L, NA, 3L, NA, 6L, NA), bhsSF_4 = c(NA, 3L, NA, 5L, NA,
    3L, NA, 5L, NA), bhs_1 = c(5L, NA, 1L, NA, 5L, NA, 5L, NA,
    NA), bhs_2 = c(5L, NA, 1L, NA, 0L, NA, 5L, NA, NA), bhs_3 = c(6L,
    NA, 0L, NA, 1L, NA, 0L, NA, NA), bhs_4 = c(5L, NA, 1L, NA,
    0L, NA, 1L, NA, NA), mssi_1 = c(0L, 0L, 3L, 2L, 0L, 0L, 3L,
    2L, NA), mssi_2 = c(0L, 1L, 2L, 1L, 0L, 1L, 2L, 1L, NA),
    mssi_3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, NA, NA), mssi_4 = c(0L,
    0L, 0L, 0L, 0L, 0L, 0L, 0L, NA), mssi_5 = c(NA, NA, 3L, 2L,
    NA, NA, 3L, 2L, NA), mssi_6 = c(NA, NA, 3L, 2L, NA, NA, 3L,
    2L, NA), mssi_7 = c(NA, NA, 3L, 2L, NA, NA, NA, NA, NA),
    mssi_8 = c(NA, NA, 1L, 1L, NA, NA, 1L, 1L, NA)), class = "data.frame", row.names = c(NA,
-9L))

奖金回合

正如我所说，我会反复执行此操作，因此将其包装在一个函数中将是理想的。我从来没有写过函数，所以如果你能告诉我是否以及如何做到这一点，那就太棒了！

【问题讨论】：

标签： r dplyr functional-programming purrr

【解决方案1】：

无法理解您的第二段，请您用输出描述它。对于第一个查询，您可以使用以下函数来计算平均值，填充计数：

CalculatorFun1<- function(df,protocol_Type){

  # protocol_Type: can take values 1/2 , based on your data
  varList<-c()
  df<-cbind("Row_number"=row.names(df),df) # adding row_number for merging the datasets
  df$Row_number<- as.character(df$Row_number)

  if(protocol_Type==1){
    varList= names(df)[grepl("bhs_",names(df))]
  } else if(protocol_Type==2){
    varList= names(df)[grepl("bhsSF_",names(df))]
  } else {
    stop("Enter correct value for protocol_Type")
  }

  temp<- df %>%
    select(varList) %>%
    mutate(Row_number=row.names(df),
           NAcnt=apply(., 1, function(x) sum(is.na(x))),
           cnt=apply(., 1, function(x) length(x)),
           Fill_Prop=1-(NAcnt/cnt),
           Avrg=round(rowMeans(.,na.rm = T),2)
    ) %>% select(Row_number,NAcnt,cnt,Fill_Prop,Avrg)

  Final_df<-df %>% left_join(temp, by =c("Row_number"="Row_number"))

  return(Final_df)

}

#call function for a protocal type
df_out<-CalculatorFun1(df,protocol_Type = 2)

【讨论】：