【问题标题】:Detecting rows with multiple observations in R在 R 中检测具有多个观察值的行
【发布时间】:2022-01-22 22:49:11
【问题描述】:

我有一个这样的数据集。我想识别在“颜色”列中具有多个值的所有观察结果,并将它们替换为“多色”

ID  color1   color2
23   red      NA
44   blue     purple
51   yellow   NA
59   green    orange

像这样:

ID  color   
23   red      
44   multicolor     
51   yellow     
59   multicolor   

任何想法将不胜感激,谢谢!

【问题讨论】:

    标签: r


    【解决方案1】:

    这似乎是一个简单的解决方案:

    library(dplyr)
    library(stringr)
    data %>%
      mutate(
        # step 1 - paste `color1` and `color2` together and remove " NA":
        color = gsub("\\sNA", "", paste(color1, color2)),
        # step 2 - count the number of white space characters:
        color = str_count(color, " "),
        # step 3 - label `color` as "multicolor" where `color` != 0:
        color = ifelse(color == 0, color1, "multicolor")) %>%
      # remove the obsolete color columns: 
      select(-matches("\\d$"))
      ID      color
    1 23        red
    2 44 multicolor
    3 51     yellow
    4 59 multicolor
    

    数据:

    data <- data.frame(ID = c(23, 44, 51, 59),
                       color1 = c("red", "blue", "yellow", "green"),
                       color2 = c(NA, "purple", NA, "orange"))
    

    【讨论】:

      【解决方案2】:

      基础 R 方法

      # get colors from columns named color*
      colo <- paste(names(table(unlist(df1[,grep("color",colnames(df1))]))), collapse="|")
      
      colo
      [1] "blue|green|red|yellow|orange|purple"
      
      # match the colors and do the conversion
      data.frame( 
        ID=df1$ID, 
        color=apply( df1, 1, function(x){ 
          y=x[grep(colo, x)];
          if(length(y)>1){y="multicolor"}; y } ) )
        ID      color
      1 23        red
      2 44 multicolor
      3 51     yellow
      4 59 multicolor
      

      数据

      df1 <- structure(list(ID = c(23L, 44L, 51L, 59L), color1 = c("red", 
      "blue", "yellow", "green"), color2 = c(NA, "purple", NA, "orange"
      )), class = "data.frame", row.names = c(NA, -4L))
      

      【讨论】:

        【解决方案3】:

        假设data 是您的数据集,您可以这样做。

        library(dplyr)
        
        data <- data.frame(ID = c(23, 44, 51, 59),
                           color1 = c("red", "blue", "yellow", "green"),
                           color2 = c(NA, "purple", NA, "orange"))
        
        data %>% 
          mutate(color = ifelse(is.na(color2), color1, "multicolor")) %>% 
          select(ID, color)
        

        【讨论】:

          【解决方案4】:

          这是 tidyverse 中的一种方法。

          library(dplyr)
          library(tidyr)
          
          df %>% 
            pivot_longer(cols = starts_with("color"), values_to = "color", values_drop_na  = TRUE) %>% 
            group_by(ID) %>% 
            summarize(n = n(),
                      color = toString(color), .groups = "drop") %>% 
            mutate(color = if_else(n > 1, "multicolor", color)) %>% 
            select(-n)
          
          # # A tibble: 4 x 2
          #      ID color     
          #   <int> <chr>     
          # 1    23 red       
          # 2    44 multicolor
          # 3    51 yellow    
          # 4    59 multicolor
          

          我是故意这样做的。请注意,如果您在 summarize() 行之后停止,您将获得实际颜色。

          # # A tibble: 4 x 3
          #      ID     n color        
          #   <int> <int> <chr>        
          # 1    23     1 red          
          # 2    44     2 blue, purple 
          # 3    51     1 yellow       
          # 4    59     2 green, orange
          

          如果您有许多颜色列,而不仅仅是 2 个,这将扩展。玩弄它,有很多方法可以调整这样的东西。


          数据

          df <- read.table(textConnection("ID  color1   color2
          23   red      NA
          44   blue     purple
          51   yellow   NA
          59   green    orange"), header = TRUE)
          

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2021-06-07
            • 1970-01-01
            • 2017-01-12
            相关资源
            最近更新 更多