【问题标题】:Consolidating factor categories from one column into a new column将一列中的因子类别合并到一个新列中
【发布时间】:2017-07-13 18:40:10
【问题描述】:

这是我第一次发布问题。我对 R 很陌生。我一直在寻找答案,但没有找到答案。所以这里。我有一个非常大的数据集(超过 140K obs),一列包含“程序类型”的类别,选项是:

  • 联邦机构
  • 联邦机构/大学
  • 全国调查计划
  • LTER
  • 大学
  • 非营利机构
  • 国家机构
  • 国家机构/公民监测计划
  • 国家机构/大学/公民监测计划
  • 部落机构

我想做的是创建一个新列,在其中将一些类别合并在一起。我想要:

  • [联邦机构、联邦机构/大学、国家调查计划]转换为联邦机构/大学
  • [LTER, University] 转换为 LTER/大学
  • [非营利机构] 转换为非营利机构
  • [State Agency] 转换为 State Agency
  • [State Agency/Citizen Monitoring Program, State Agency/University/Citizen Monitoring Program] 转换为 Citizen Science Monitoring Program
  • [部落机构] 转换为部落机构

其中一些将保持不变。我已经尝试过 ifelse 语句,但似乎很难确定原始列中的内容并返回 NA 以进行大量观察。我检查了我所有的拼写,所以不是这样。这是我根据此处某处的另一个答案所尝试的。我的数据集的名称是 TP_state,其他列中的名称是 lagoslakeid。但是,它无法正常工作。任何帮助将不胜感激!

x <- c(TP_state$programtype) 
y <- c(TP_state$lagoslakeid)
df <- data.frame(x,y)
DT <- data.table(df)
DT[, Program_Type := ifelse(x %in% c("Federal Agency", "Federal Agency/University", "National Survey Program"), "Federal Agency/University",
                 ifelse(x %in% c("LTER", "University"), "LTER/University",
                 ifelse(x %in% c("Non-Profit Agency"), "Non-Profit Agency",       
                 ifelse(x %in% c("State Agency"), "State Agency",
                 ifelse(x %in% c("State Agency/University/Citizen Monitoring Program", "State Agency/Citizen Monitoring Program"), "Citizen Monitoring Program", 
                 ifelse(x %in% c("Tribal Agency"), "Tribal Agency", NA))))))]  

【问题讨论】:

  • fct_collapse 来自forcats 包。

标签: r


【解决方案1】:

我会尝试这样的事情。请告诉我它是否适合您!

for(i in 1:length(df$column_with_factors)){
  if(grepl(pattern = 'federal agency|national survey program', x = df$column[i], ignore.case = TRUE)){
    x <- 'Federal Agency/University'
  } else if(grepl(pattern = '^lter$|^university$', x = df$column[i], ignore.case = TRUE)){
    x <- 'LTER/University'
  } else if(grepl(pattern = 'non-profit agency', x = df$column[i], ignore.case = TRUE)){
    x <- 'Non-profit Agency'
  } else if(grepl(pattern = '^state agency$', x = df$column[i], ignore.case = TRUE)){
    x <- 'State Agency'
  } else if(grepl(pattern = 'state agency/(citizen monitoring program|university/citizen monitoring program)', x = df$column[i], ignore.case = TRUE)){
    x <- 'Citizen Science Monitoring Program'
  } else if(grepl(pattern = 'tribal agency', x = df$column[i], ignore.case = TRUE)){
    x <- 'Tribal Agency'
  } else x <- NA
}

df$column_with_factors <- as.factor(df$column_with_factors)

但这会运行得更快:

df$column_with_factors <- sapply(df$column_with_factors, function(x){
  if(grepl(pattern = 'federal agency|national survey program', x = x, ignore.case = TRUE)){
    x <- 'Federal Agency/University'
  } else if(grepl(pattern = '^lter$|^university$', x = x, ignore.case = TRUE)){
    x <- 'LTER/University'
  } else if(grepl(pattern = 'non-profit agency', x = x, ignore.case = TRUE)){
    x <- 'Non-profit Agency'
  } else if(grepl(pattern = '^state agency$', x = x, ignore.case = TRUE)){
    x <- 'State Agency'
  } else if(grepl(pattern = 'state agency/(citizen monitoring program|university/citizen monitoring program)', x = x, ignore.case = TRUE)){
    x <- 'Citizen Science Monitoring Program'
  } else if(grepl(pattern = 'tribal agency', x = x, ignore.case = TRUE)){
    x <- 'Tribal Agency'
  } else x <- NA
})

df$column_with_factors <- as.factor(df$column_with_factors)

【讨论】:

    【解决方案2】:

    forcats 包非常适合重新编写此类任务。

    首先创建一些演示数据...

    library(tidyverse)
    library(forcats)
    
    df <-
      tibble(
        programtype = c(
          "Federal Agency",
          "Federal Agency",
          "Federal Agency",
          "State Agency/University/Citizen Monitoring",
          "State Agency/University/Citizen Monitoring Program",
          "Federal Agency/University",
          "National Survey Program",
          "LTER",
          "University",
          "Non-Profit Agency",
          "Non-Profit Agency",
          "Non-Profit Agency",
          "Non-Profit Agency",
          "Non-Profit Agency",
          "State Agency",
          "State Agency",
          "State Agency/Citizen Monitoring Program",
          "State Agency/University/Citizen Monitoring Program",
          "Tribal Agency",
          "Tribal Agency",
          "Tribal Agency"
        ),
        ID = 1:21
      )
    

    然后使用fct_recode 替换值...

    df %>%
      mutate(
        new_categories = fct_recode(
          programtype,
          "Federal Agency/University" = "Federal Agency",
          "Federal Agency/University" = "Federal Agency/University",
          "Federal Agency/University" = "National Survey Program",
          "LTER/University" = "LTER",
          "LTER/University" = "University",
          "Citizen Science Monitoring Program" = "State Agency/Citizen Monitoring Program",
          "Citizen Science Monitoring Program" = "State Agency/University/Citizen Monitoring"
        )
      )
    

    【讨论】:

      猜你喜欢
      • 2014-11-08
      • 1970-01-01
      • 1970-01-01
      • 2019-03-10
      • 1970-01-01
      • 2018-08-16
      • 2012-08-31
      • 1970-01-01
      • 2011-06-16
      相关资源
      最近更新 更多