One-hot编码文本字符串[重复]答案

【问题标题】：One-hot encoding a text string [duplicate]One-hot编码文本字符串[重复]
【发布时间】：2021-03-23 05:07:00
【问题描述】：

我有一列包含混合字符串，我创建了列来表示字符串中的每个唯一字符。如果字符串中的任何字符与这些列之一匹配，我需要使用 [1,0] 对列进行编码。

library(data.table)
d = data.table(string = c("P_P_F_", "U_F_/", "-_P_B"),
               P = c(1,  0, 1),
               F = c(1, 1, 0),
               U = c(0, 1, 0),
               B = c(0, 0, 1))

在上面的示例中，string 具有我需要与相应列匹配的字符。第一个字符串有一个P 和F，所以我在这些列中有一个1，其余的有一个0。字符串中的字符始终用下划线分隔，最大长度为 7。

数据集相当大，所以我更喜欢 data.table 解决方案是可能的。

【问题讨论】：

这应该可以帮助您：Split string column to create new binary columns

标签： r data.table

【解决方案1】：

分割字符串后我们可以使用mtabulate

library(qdapTools)
cbind(d, mtabulate(strsplit(d$string, "[_/-]")))

数据

d <- data.table(string = c("P_P_F_", "U_F_/", "-_P_B"))

【讨论】：

谢谢，这比上面的答案慢了大约 10 秒，但在语法方面更简洁。

【解决方案2】：

删除前导和滞后标点，使字符串干净，每个字符之间只有一个分隔符，然后使用cSplit_e，它在内部使用data.table。

library(data.table)
d = data.table(string = c("P_P_F_", "U_F_/", "-_P_B"))

d$string <- trimws(d$string, whitespace = '[[:punct:]]')
splitstackshape::cSplit_e(d, 'string', sep = '_', type = 'character', fill = 0)

#   string string_B string_F string_P string_U
#1:  P_P_F        0        1        1        0
#2:    U_F        0        1        0        1
#3:    P_B        1        0        1        0

【讨论】：

谢谢，这很好，但在我的数据集上，我收到了关于无法分配大小为 n 的向量的错误。我的数据总共是 220 万行

【解决方案3】：

data.table 选项

d[
  ,
  cbind(
    string,
    as.data.table(
      +(t(
        table(
          stack(
            setNames(
              Map(Filter, list(nchar), strsplit(string, "[_/-]")),
              seq_along(string)
            )
          )
        )
      ) > 0)
    )
  )
]

给予

   string B F P U
1: P_P_F_ 0 1 1 0
2:  U_F_/ 0 1 0 1
3:  -_P_B 1 0 1 0

【讨论】：