R - 用数字替换字符串时避免连接答案

【问题标题】：R - Avoid concatenation when replacing string by numberR - 用数字替换字符串时避免连接
【发布时间】：2021-04-20 13:01:43
【问题描述】：

看起来是个很简单的问题，但目前我还没有找到任何解决方案。

考虑以下数据框：

dat <- data.frame(id=LETTERS[1:5],
                  land.use=c(3,4,9,34,39))

我需要用字符串替换land.use 列中的数字。问题是：我对数字 3、4 和 34 有不同的字符串。

但是，R 坚持将34 替换为3 和4 的串联字符串。

例如：

dat$land.use <- gsub("3","Bare soil", dat$land.use)
dat$land.use <- gsub("4","Primary Forest", dat$land.use)
dat$land.use <- gsub("9","Secondary Forest", dat$land.use)
dat$land.use <- gsub("34","Wheat", dat$land.use)
dat$land.use <- gsub("39","Soybean", dat$land.use)

> dat
  id                  land.use
1  A                 Bare soil # This is OK
2  B            Primary Forest # This is OK
3  C          Secondary Forest # This is OK
4  D   Bare soilPrimary Forest # This should be Wheat
5  E Bare soilSecondary Forest # This should be Soybean

我做错了什么？

【问题讨论】：

标签： r replace gsub

【解决方案1】：

当您想要执行完全匹配时，不要使用部分匹配函数（gsub、grep 等）。您可以创建查找表并执行连接。

lookup_table <- data.frame(land.use = c(3, 4, 9, 34, 39), 
                           value = c("Bare soil", "Primary Forest", 
                           "Secondary Forest", "Wheat", "Soybean"))

merge(dat, lookup_table, all.x = TRUE, by = 'land.use')

#  land.use id            value
#1        3  A        Bare soil
#2        4  B   Primary Forest
#3        9  C Secondary Forest
#4       34  D            Wheat
#5       39  E          Soybean

【讨论】：

【解决方案2】：

在这种情况下，我会使用match 来用字符串替换数字。

c("Bare soil","Primary Forest","Secondary Forest","Wheat",
  "Soybean")[match(dat$land.use, c(3,4,9,34,39))]
#[1] "Bare soil"        "Primary Forest"   "Secondary Forest" "Wheat"           
#[5] "Soybean"

要使用您的方法，您必须添加 ^ 和 $。

dat$land.use <- sub("^3$","Bare soil", dat$land.use)
dat$land.use <- sub("^4$","Primary Forest", dat$land.use)
dat$land.use <- sub("^9$","Secondary Forest", dat$land.use)
dat$land.use <- sub("^34$","Wheat", dat$land.use)
dat$land.use <- sub("^39$","Soybean", dat$land.use)
dat
#  id         land.use
#1  A        Bare soil
#2  B   Primary Forest
#3  C Secondary Forest
#4  D            Wheat
#5  E          Soybean

【讨论】：

【解决方案3】：

根据您接下来要做什么，您也可能需要一个factor() 变量。您可以这样做，或者使用其他方法之一，稍后再使用as.factor()。

dat$land.use.factor <- factor(dat$land.use, 
                              levels = c(3, 4, 9, 34, 39),
                              labels = c("Bare soil", "Primary Forest", 
                                         "Secondary Forest", "Wheat", "Soybean"))

# > dat
#    id land.use  land.use.factor
# 1   A        3        Bare soil
# 2   B        4   Primary Forest
# 3   C        9 Secondary Forest
# 4   D       34            Wheat
# 5   E       39          Soybean

【讨论】：

【解决方案4】：

我们可以使用left_join

library(dplyr)
left_join(df1, keydat, by = 'land.use')

数据

keydat <- data.frame(land.use = c(3, 4, 9, 34, 39), 
                           value = c("Bare soil", "Primary Forest", 
                           "Secondary Forest", "Wheat", "Soybean"))

【讨论】：