R：将因子列转换为多个布尔列答案

【问题标题】：R: Convert factor column to multiple boolean columnsR：将因子列转换为多个布尔列
【发布时间】：2026-02-17 13:15:02
【问题描述】：

我正在尝试将一个 factor 列转换为多个 boolean 列，如下图所示。数据来自气象站，使用精细的weatherData 包检索。我要转换为多个布尔列的因子列包含 11 个因子。其中有些是单个“事件”，有些是“事件”的组合。

这是一张图片，展示了我想要实现的目标：这是 R 代码，它将生成具有组合因素的数据框，我想将其转换为几个布尔列：

df <- read.table(text="
date    Events
1/8/2013    Rain
1/9/2013    Fog
1/10/2013   ''
1/11/2013   Fog-Rain
1/12/2013   Snow
1/13/2013   Rain-Snow
1/14/2013   Rain-Thunderstorm
1/15/2013   Thunderstorm
1/16/2013   Fog-Rain-Thunderstorm
1/17/2013   Fog-Thunderstorm
1/18/2013   Fog-Rain-Thunderstorm-Snow",
                 header=T)
df$date <- as.character(as.Date(df$date, "%m/%d/%Y"))

提前致谢。

【问题讨论】：

标签： r boolean dataframe

【解决方案1】：

你可以试试：

 lst <- strsplit(as.character(df$Events),"-")
 lvl <- unique(unlist(lst))      
 res <- data.frame(date=df$date,
            do.call(rbind,lapply(lst, function(x) table(factor(x, levels=lvl)))), 
                                       stringsAsFactors=FALSE)

  res
 #         date Rain Fog Snow Thunderstorm
 #1  2013-01-08    1   0    0            0
 #2  2013-01-09    0   1    0            0
 #3  2013-01-10    0   0    0            0
 #4  2013-01-11    1   1    0            0
 #5  2013-01-12    0   0    1            0
 #6  2013-01-13    1   0    1            0
 #7  2013-01-14    1   0    0            1
 #8  2013-01-15    0   0    0            1
 #9  2013-01-16    1   1    0            1
 #10 2013-01-17    0   1    0            1
# 11 2013-01-18    1   1    1            1

或者，这可能比上述更快（由@alexis_laz 提供）

  setNames(data.frame(df$date, do.call(rbind,lapply(lst, function(x) as.integer(lvl %in% x)) )), c("date", lvl))

或者

 library(devtools)
 library(data.table)
 source_gist("11380733")
 library(reshape2) #In case it is needed 

 res1 <- dcast.data.table(cSplit(df, "Events", "-", "long"), date~Events)
 res2 <- merge(subset(df, select=1), res1, by="date", all=TRUE)
 res2 <- as.data.frame(res2)
 res2[,-1]  <- (!is.na(res2[,-1]))+0
 res2[,c(1,3,2,4,5)]
 #          date Rain Fog Snow Thunderstorm
  #1  2013-01-08    1   0    0            0
  #2  2013-01-09    0   1    0            0
  #3  2013-01-10    0   0    0            0
  #4  2013-01-11    1   1    0            0
  #5  2013-01-12    0   0    1            0
  #6  2013-01-13    1   0    1            0
  #7  2013-01-14    1   0    0            1
  #8  2013-01-15    0   0    0            1
  #9  2013-01-16    1   1    0            1
  #10 2013-01-17    0   1    0            1
  #11 2013-01-18    1   1    1            1

或者

 library(qdap)
 with(df, termco(Events, date, c("Rain", "Fog", "Snow", "Thunderstorm")))[[1]][,-2]
 #         date Rain Fog Snow Thunderstorm
 #1  2013-01-08    1   0    0            0
 #2  2013-01-09    0   1    0            0
 #3  2013-01-10    0   0    0            0
 #4  2013-01-11    1   1    0            0
 #5  2013-01-12    0   0    1            0
 #6  2013-01-13    1   0    1            0
 #7  2013-01-14    1   0    0            1
 #8  2013-01-15    0   0    0            1
 #9  2013-01-16    1   1    0            1
 #10 2013-01-17    0   1    0            1
 #11 2013-01-18    1   1    1            1

【讨论】：

第二个例子需要 reshape2 进行 dcast
@Spacedman。谢谢。我认为单独从data.table 中的dcast 可以工作，因为我同时加载了reshape2 和data.table。似乎dcast.data.table 是不使用reshape2 就可以工作的@
@David Arenburg。我在没有加载 reshape2 的情况下尝试了它，它对我有用。
@David Arenburg。我的是data.table_1.9.2
@David Arenburg。我在新的控制台上再次尝试了它。在library(data.table) 之后，检查了 sessionInfo() loaded via a namespace (and not attached): [1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.4 stringr_0.6.2

【解决方案2】：

我能想到的最简单的事情是我的“splitstackshape”包中的concat.split.expanded (devel version 1.3.0, from GitHub)。

## Get the right version of the package
library(devtools)
install_github("splitstackshape", "mrdwab", ref = "devel")
packageVersion("splitstackshape")
# [1] ‘1.3.0’

## Split up the relevant column
concat.split.expanded(df, "Events", "-", type = "character", 
                      fill = 0, drop = TRUE)
#          date Events_Fog Events_Rain Events_Snow Events_Thunderstorm
# 1  2013-01-08          0           1           0                   0
# 2  2013-01-09          1           0           0                   0
# 3  2013-01-10          0           0           0                   0
# 4  2013-01-11          1           1           0                   0
# 5  2013-01-12          0           0           1                   0
# 6  2013-01-13          0           1           1                   0
# 7  2013-01-14          0           1           0                   1
# 8  2013-01-15          0           0           0                   1
# 9  2013-01-16          1           1           0                   1
# 10 2013-01-17          1           0           0                   1
# 11 2013-01-18          1           1           1                   1

回答这个问题时，我意识到我在concat.split.expanded 中对“修剪”功能进行了一些愚蠢的硬编码，这可能会大大减慢速度。如果您想要更快的方法，请直接在“事件”列的拆分版本上使用 charMat（concat.split.expanded 调用的函数），如下所示：

splitstackshape:::charMat(
    strsplit(as.character(indf[, "Events"]), "-", fixed = TRUE), fill = 0)

对于一些基准，请查看this Gist。

【讨论】：

【解决方案3】：

可以使用 'grep' 与 base R 一起完成：

ddf = data.frame(df$date, df$Events, "Rain"=rep(0), "Fog"=rep(0), "Snow"=rep(0), "Thunderstorm"=rep(0)) 

for(i in 3:6)   ddf[grep(names(ddf)[i],ddf[,2]),i]=1

ddf
      df.date                  df.Events Rain Fog Snow Thunderstorm
1  2013-01-08                       Rain    1   0    0            0
2  2013-01-09                        Fog    0   1    0            0
3  2013-01-10                               0   0    0            0
4  2013-01-11                   Fog-Rain    1   1    0            0
5  2013-01-12                       Snow    0   0    1            0
6  2013-01-13                  Rain-Snow    1   0    1            0
7  2013-01-14          Rain-Thunderstorm    1   0    0            1
8  2013-01-15               Thunderstorm    0   0    0            1
9  2013-01-16      Fog-Rain-Thunderstorm    1   1    0            1
10 2013-01-17           Fog-Thunderstorm    0   1    0            1
11 2013-01-18 Fog-Rain-Thunderstorm-Snow    1   1    1            1

【讨论】：

+1。我只是在更新my benchmarks，如果您提前知道预期值，这是一个快速的解决方案（迄今为止最快的）。
我已在此答案的基础上动态识别唯一选项，因此您无需提前知道值。

【解决方案4】：

这是qdapTools 的一种方法：

library(qdapTools)

matrix2df(mtabulate(lapply(split(as.character(df$Events), df$date), 
    function(x) strsplit(x, "-")[[1]])), "Date")

##          Date Fog Rain Snow Thunderstorm
## 1  2013-01-08   0    1    0            0
## 2  2013-01-09   1    0    0            0
## 3  2013-01-10   0    0    0            0
## 4  2013-01-11   1    1    0            0
## 5  2013-01-12   0    0    1            0
## 6  2013-01-13   0    1    1            0
## 7  2013-01-14   0    1    0            1
## 8  2013-01-15   0    0    0            1
## 9  2013-01-16   1    1    0            1
## 10 2013-01-17   1    0    0            1
## 11 2013-01-18   1    1    1            1

这是与magrittr 相同的答案，因为它使链条更清晰：

split(as.character(df$Events), df$date) %>%
    lapply(function(x) strsplit(x, "-")[[1]]) %>%
    mtabulate() %>%
    matrix2df("Date")

【讨论】：

【解决方案5】：

用因子创建一个向量

set.seed(1)
n <- c("Rain", "Fog", "Snow", "Thunderstorm")
v <- sapply(sample(0:3,100,T), function(i) paste0(sample(n,i), collapse = "-"))
v <- as.factor(v)

返回具有所需输出的矩阵的函数，该输出应cbind'ed 到初始data.frame

mSplit <- function(vec) {
  if (!is.character(vec))
    vec <- as.character(vec)
  L <- strsplit(vec, "-")
  ids <- unlist(lapply(seq_along(L), function(i) rep(i, length(L[[i]])) ))
  U <- sort(unique(unlist(L)))
  M <- matrix(0, nrow = length(vec), 
              ncol = length(U), 
              dimnames = list(NULL, U))
  M[cbind(ids, match(unlist(L), U))] <- 1L
  M
}

解决方案基于 Ananda Mahto 对 SO question 的回答。它应该很快。

res <- mSplit(v)

【讨论】：

+1。这基本上就是您在我的回答中提到的“splitstackshape”中charMat 的代码中看到的内容。

【解决方案6】：

我认为在这种情况下您需要的是对函数dummy 的简单调用。让我们调用目标列。 target_cat.

df_target_bin <- data.frame(dummy(target_cat, "<prefix>"))

这将创建一个新的数据框，其中有一列包含 0 和 1 值，每个值 target_cat。

要将列转换为逻辑列，逻辑的意思是值是TRUE 和FALSE，然后使用函数as.logical。

df_target_logical <- apply(df_target_bin, as.logical)

【讨论】：

【解决方案7】：

以@rnso 的回答为基础

以下将识别所有唯一元素，然后动态生成包含相关数据的新列。

options = unique(unlist(strsplit(df$Events, '-'), recursive=FALSE))
for(o in options){
  df$newcol = rep(0)
  df <- rename(df, !!o := newcol)
  df[grep(o, df$Events), o] = 1
}

结果：

         date                     Events Rain Fog Snow Thunderstorm
1  2013-01-08                       Rain    1   0    0            0
2  2013-01-09                        Fog    0   1    0            0
3  2013-01-10                               0   0    0            0
4  2013-01-11                   Fog-Rain    1   1    0            0
5  2013-01-12                       Snow    0   0    1            0
6  2013-01-13                  Rain-Snow    1   0    1            0
7  2013-01-14          Rain-Thunderstorm    1   0    0            1
8  2013-01-15               Thunderstorm    0   0    0            1
9  2013-01-16      Fog-Rain-Thunderstorm    1   1    0            1
10 2013-01-17           Fog-Thunderstorm    0   1    0            1
11 2013-01-18 Fog-Rain-Thunderstorm-Snow    1   1    1            1

【讨论】：

很好的修改！