使用正则表达式重命名多个列答案

【问题标题】：Renaming multiple columns using regexp使用正则表达式重命名多个列
【发布时间】：2021-11-04 04:37:21
【问题描述】：

问题：

我想通过替换某些重复的字符串来重命名大量列名。

Reprex：

library(dplyr)
library(stringr)

code <- c(round(runif(26, 0, 100),0))
names <- letters
AIYN <- stringi::stri_rand_strings(26, 2)
SIYN <- stringi::stri_rand_strings(26, 2)


df <- bind_cols(code, names, AIYN, SIYN)
colnames(df) <- c("code (2021)", "names (2021)", "all the info you need (AIYN) from A to Z", 
                  "some info you need (SIYN) from A to Z")
View(df)

尝试的解决方案

colnames(df) <- str_replace_all(colnames(df), "[(2021)]", "")
colnames(df) <- str_replace_all(colnames(df), "all the info you need (AIYN) from A to Z", "AIYN")
colnames(df) <- str_replace_all(colnames(df), "some info you need (SIYN) from A to Z", "SIYN")

目标

我想删除带有数字的括号（例如“（2019）”），并保留括号中的字符，其中只有字符（例如“（AIYN）”，“（SIYN）”）。我的解决方案冗长，因为我的数据框有一百多列。

【问题讨论】：

我想知道colnames(df) <- coalesce(str_extract(colnames(df), '(?<=\\()[A-Za-z]+(?=\\))'), str_replace_all(colnames(df), "\\s*\\(\\d+\\)", "")) 是否适合您。

标签： r regex rename

【解决方案1】：

要删除带数字的括号，您需要

stringr::str_replace_all(colnames(df), "\\s*\\(\\d+\\)", "")
stringr::str_remove_all(colnames(df), "\\s*\\(\\d+\\)")
gsub("\\s*\\(\\d+\\)", "", colnames(df))

如果括号内的数字必须由 4 位数字组成，请将 \d+ 替换为 \d{4}。

将上面的代码放在trimws(...) 中以加入前导/尾随空格。

请参阅regex demo。

要将第一个仅包含字母的值保留在括号内，您需要

stringr::str_extract(colnames(df), '(?<=\\()[A-Za-z]+(?=\\))') # ASCII only
stringr::str_extract(colnames(df), '(?<=\\()\\p{L}+(?=\\))')   # Any Unicode

两者结合：

colnames(df) <- coalesce(str_extract(colnames(df), '(?<=\\()[A-Za-z]+(?=\\))'), str_replace_all(colnames(df), "\\s*\\(\\d+\\)", ""))

R 测试

library(dplyr)
library(stringr)

x <-  c("code (2021)", "names (2021)", "all the info you need (AIYN) from A to Z", 
        "some info you need (SIYN) from A to Z")

z <- str_replace_all(x, "\\s*\\(\\d+\\)", "")
# => [1] "code" "names" "all the info you need (AIYN) from A to Z" [4] "some info you need (SIYN) from A to Z"
y <- str_extract(z, '(?<=\\()[A-Za-z]+(?=\\))')
# => [1] NA     NA     "AIYN" "SIYN"
coalesce(y, z)
# => "code"  "names" "AIYN"  "SIYN"

【讨论】：

【解决方案2】：

你可以试试-

library(magrittr)

names(df) <- sub('\\s\\(\\d+\\)', '', names(df)) %>%
                sub('.*\\(([A-Z]+)\\).*', '\\1', .)
names(df)
#[1] "code"  "names" "AIYN"  "SIYN"

第一个sub 将括号内的数字与空格一起删除。

第二个sub 在括号内提取多个[A-Z] 值。

将它与dplyr 和管道一起使用 -

library(dplyr)
df %>% 
    rename_with(~sub('\\s\\(\\d+\\)', '', .) %>% 
                 sub('.*\\(([A-Z]+)\\).*', '\\1', .))

#    code names AIYN  SIYN 
#   <dbl> <chr> <chr> <chr>
# 1     1 a     1A    NR   
# 2    96 b     Dq    hi   
# 3    46 c     28    AQ   
# 4    78 d     Y8    xH   
# 5    76 e     ps    ES   
# 6    56 f     m5    gQ   
# 7    51 g     vV    8u   
# 8    72 h     Hw    JV   
# 9    24 i     0T    7A   
#10    76 j     mq    Qy   
# … with 16 more rows

【讨论】：

非常感谢。你知道我如何将它与 dplyr 和 %>% 一起使用吗？
这更多的是我自己的知识，但是（1）为什么我们必须在 sub 之前使用波浪号（~），以及（2）我们为什么在里面使用管道（%>%）函数 rename_with?
1) 波浪号是dplyr 中用于应用函数的语法。 2）我们可以在有/没有管道的情况下完成这项工作。管道通常用于将一个函数的输出作为输入传递给另一个函数。参见例如c(2, 4, 5) %>% sum %>% sqrt。