【问题标题】:R - Filling missing values (blanks) based upon values on the same row but different columnR - 根据同一行但不同列上的值填充缺失值(空白)
【发布时间】:2015-07-10 23:32:19
【问题描述】:

我正在使用 R 并且有以下数据框示例,其中所有变量都是因子:

  first            second  third
 social     birth control   high
            birth control   high
medical  Anorexia Nervosa    low
medical  Anorexia Nervosa    low
               Alcoholism   high
 family        Alcoholism   high

基本上,我需要一个函数来帮助我根据第二列和第三列中的值填充第一列中的空白。 例如,如果我在第二栏中有“节育”,在第三栏中有“高”,我需要用“社交”填写第一栏中的空白。如果第二列和第三列分别是“酗酒”和“高”,我需要用“家庭”填写第一列的空白。

【问题讨论】:

  • 除此之外...发布您尝试做的事情。请提供代码。
  • 你也有条件清单吗?
  • 你也可以在this question(and answers)找到一些灵感

标签: r missing-data autofill


【解决方案1】:

根据显示的数据,对于“第二”和“第三”的每个组合,“第一”中是否还有其他值还不是很清楚。如果只有一个值并且您需要将'' 替换为该值,那么您可以尝试

library(data.table)
setDT(df1)[, replace(first, first=='', first[first!='']),
                                         list(second, third)]

或者更有效的方法是

setDT(df1)[, first:= first[first!=''] , list(second, third)]
#     first           second third
#1:  social    birth control  high
#2:  social    birth control  high
#3: medical Anorexia Nervosa   low
#4: medical Anorexia Nervosa   low
#5:  family       Alcoholism  high
#6:  family       Alcoholism  high

数据

df1 <- structure(list(first = c("social", "", "medical", "medical", 
"", "family"), second = c("birth control", "birth control", 
"Anorexia Nervosa", 
"Anorexia Nervosa", "Alcoholism", "Alcoholism"), third = c("high", 
"high", "low", "low", "high", "high")), .Names = c("first", "second", 
"third"), class = "data.frame", row.names = c(NA, -6L))

【讨论】:

  • 就个人而言,我发现replace() 效率低下,与使用:= 的相比,可读性较差。
  • 谢谢大家...每个解决方案都证明对我的问题很有帮助!
【解决方案2】:

一种方法是创建某种查找列表(例如,使用命名向量 factor 或类似的东西),然后将任何 "" 值替换为查找列表中的值。

这是一个示例(尽管我认为您的问题没有完全定义,并且可能过于简化)。

library(dplyr)
library(tidyr)

mydf %>%
  unite(condition, second, third, remove = FALSE) %>%
  mutate(condition = factor(condition, 
                            c("birth control_high", "Anorexia Nervosa_low", "Alcoholism_high"),
                            c("social", "medical", "family"))) %>%
  mutate(condition = as.character(condition)) %>%
  mutate(first = replace(first, first == "", condition[first == ""])) %>%
  select(-condition)
#     first           second third
# 1  social    birth control  high
# 2  social    birth control  high
# 3 medical Anorexia Nervosa   low
# 4 medical Anorexia Nervosa   low
# 5  family       Alcoholism  high
# 6  family       Alcoholism  high

“data.table”方法将遵循相同的步骤,但具有通过引用修改而不是复制的优势。

library(data.table)
as.data.table(mydf)[
  , condition := sprintf("%s_%s", second, third)][
    , condition := as.character(
      factor(condition, 
             c("birth control_high", "Anorexia Nervosa_low", "Alcoholism_high"),
             c("social", "medical", "family")))][
               first == "", first := condition][
                 , condition := NULL][]

【讨论】:

    【解决方案3】:

    dplyr 的另一种方法使用 @akrun 非常好的解决方案

    library(dplyr)
    
    df1 %>% group_by(second, third) %>% 
      mutate(first=replace(first, first=='', first[first!=''])) %>% ungroup
    

    数据

    df1 <- structure(list(first = c("social", "", "medical", "medical", 
    "", "family"), second = c("birth control", "birth control", 
    "Anorexia Nervosa", 
    "Anorexia Nervosa", "Alcoholism", "Alcoholism"), third = c("high", 
    "high", "low", "low", "high", "high")), .Names = c("first", "second", 
    "third"), class = "data.frame", row.names = c(NA, -6L))
    

    【讨论】:

      猜你喜欢
      • 2020-01-02
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-03-14
      • 1970-01-01
      • 2020-03-24
      • 1970-01-01
      相关资源
      最近更新 更多