【问题标题】:Conditionally remove rows based on another column in R基于 R 中的另一列有条件地删除行
【发布时间】:2021-08-13 04:05:07
【问题描述】:

我有以下df:

city_code;  name;   job;
489         Jonh    Engineer 
489         Adam    Economist     
128         Mary    Entrepreneur  
128         Matt    Physician    
147         Rob     Entrepreneur
147         Gomez   Retired
199         Thomas  Entrepeneuer
199         Ryan    Entrepeneuer

我的 df 有数千行。每个城市都有两个不同的名称。我想在工作栏中选择每个只有一个人作为企业家的城市。 df 应如下所示:

city_code;  name;  job;
128         Mary   Entrepreneur
128         Matt   Physician
147         Rob    Entrepreneur
147         Gomez  Retired

我想在 df 中保留其他列。 感谢您的帮助。

【问题讨论】:

  • 了解如何对数据框进行子集化(过滤)[statmethods.net/management/subset.html],然后执行两次。首先保留job=='Entrepreneur'的所有行,将其保存为单独的df,然后保留最后一个df中city_code所在的所有行。

标签: r text filter statistics economics


【解决方案1】:

我们可以使用

library(dplyr)
df %>% 
   group_by(city_code) %>% 
   filter(sum(job == 'Entrepreneur') == 1) %>%
   ungroup

-输出

# A tibble: 4 x 3
  city_code name  job         
      <dbl> <chr> <chr>       
1       128 Mary  Entrepreneur
2       128 Matt  Physician   
3       147 Rob   Entrepreneur
4       147 Gomez Retired     

【讨论】:

    【解决方案2】:

    请考虑下次使用reprex 包,并尽量避免在可重现数据集中出现错别字。 这是一个可能的解决方案:

    library(tidyverse)
    df <- 
      tibble::tribble(
        ~ city_code, ~ name, ~ job,
        489, "Jonh"  , "Engineer",
        489, "Adam"  , "Economist",
        128, "Mary"  , "Entrepreneur",
        128, "Matt"  , "Physician",
        147, "Rob"   , "Entrepreneur",
        147, "Gomez" , "Retired",
        199, "Thomas", "Entrepreneur",
        199, "Ryan"  , "Entrepreneur"
      )
    df %>% 
      group_by(city_code) %>% 
      add_count(job) %>% 
      filter((job == "Entrepreneur" & n == 1) | job != "Entrepreneur")
    #> # A tibble: 6 x 4
    #> # Groups:   city_code [3]
    #>   city_code name  job              n
    #>       <dbl> <chr> <chr>        <int>
    #> 1       489 Jonh  Engineer         1
    #> 2       489 Adam  Economist        1
    #> 3       128 Mary  Entrepreneur     1
    #> 4       128 Matt  Physician        1
    #> 5       147 Rob   Entrepreneur     1
    #> 6       147 Gomez Retired          1
    

    reprex package (v2.0.0) 于 2021-05-24 创建

    【讨论】:

      【解决方案3】:

      data.table 选项

      > setDT(df)[, .SD[sum(job == "Entrepreneur") == 1], city_code]
         city_code  name          job
      1:       128  Mary Entrepreneur
      2:       128  Matt    Physician
      3:       147   Rob Entrepreneur
      4:       147 Gomez      Retired
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2014-07-18
        • 1970-01-01
        • 2016-07-31
        • 2015-08-31
        • 2021-06-09
        • 1970-01-01
        • 2012-05-19
        • 2017-09-22
        相关资源
        最近更新 更多