【问题标题】:Create dummy variables from string with multiple values从具有多个值的字符串创建虚拟变量
【发布时间】:2019-03-02 19:58:03
【问题描述】:

我的数据集有一列包含多个值,由 ; 分隔。

  name    sex     good_at
1 Tom      M   Drawing;Hiking
2 Mary     F   Cooking;Joking
3 Sam      M      Running
4 Charlie  M      Swimming

我想为good_at 中的每个唯一值创建一个虚拟变量,这样每个虚拟变量都包含一个TRUEFALSE,以指示该个人是否拥有该特定值。

期望的输出

Drawing   Cooking
True       False
False      True
False      False
False      False

【问题讨论】:

  • 我需要解决的问题是现有变量包含多个信息,例如绘图+徒步旅行。我必须在谷歌表中使用像 REGEXMATCH 这样的函数,但我不知道如何在 R 中编码。@CristianE.Nuno
  • 啊,我现在明白了。你的问题不一样。谢谢你的澄清。

标签: r reshape dummy-variable one-hot-encoding


【解决方案1】:

概述

要为good_at 中的每个唯一值创建虚拟变量,需要执行以下步骤:

  • good_at 分成多行
  • 为每个 good_at 中的每个 name-sex 对生成虚拟变量 - 使用 dummy::dummy()
  • 将数据重新整形为 4 列:namesexkeyvalue
    • key 包含所有虚拟变量列名
    • value 包含每个虚拟变量中的值
  • 只保留value 不为零的记录
  • 将数据重整为每个名称-性别对的一条记录,以及与key 中一样多的列
  • 将虚拟列转换为逻辑向量。

代码

# load necessary packages ----
library(dummy)
library(tidyverse)

# load necessary data ----
df <-
  read.table(text = "name    sex     good_at
1 Tom      M   Drawing;Hiking
             2 Mary     F   Cooking;Joking
             3 Sam      M      Running
             4 Charlie  M      Swimming"
             , header = TRUE
             , stringsAsFactors = FALSE)

# create a longer version of df -----
# where one record represents
# one unique name, sex, good_at value
df_clean <-
  df %>%
  separate_rows(good_at, sep = ";")

# create dummy variables for all unique values in "good_at" column ----
df_dummies <-
  df_clean %>%
  select(good_at) %>%
  dummy() %>%
  bind_cols(df_clean) %>%
  # drop "good_at" column 
  select(-good_at) %>%
  # make the tibble long by reshaping it into 4 columns:
  # name, sex, key and value
  # where key are the all dummy variable column names
  # and value are the values in each dummy variable
  gather(key, value, -name, -sex) %>%
  # keep records where
  # value is not equal to zero
  # note: this is due to "Tom" having both a 
  # "good_at_Drawing" value of 0 and 1. 
  filter(value != 0) %>%
  # make the tibble wide
  # with one record per name-sex pair
  # and as many columns as there are in key
  # with their values from value
  # and filling NA values to 0
  spread(key, value, fill = 0) %>%
  # for each name-sex pair
  # cast the dummy variables into logical vectors
  group_by(name, sex) %>%
  mutate_all(funs(as.integer(.) %>% as.logical())) %>%
  ungroup() %>%
  # just for safety let's join
  # the original "good_at" column
  left_join(y = df, by = c("name", "sex")) %>%
  # bring the original "good_at" column to the left-hand side 
  # of the tibble
  select(name, sex, good_at, matches("good_at_"))

# view result ----
df_dummies
# A tibble: 4 x 9
#   name  sex   good_at good_at_Cooking good_at_Drawing good_at_Hiking
#   <chr> <chr> <chr>   <lgl>           <lgl>           <lgl>         
# 1 Char… M     Swimmi… FALSE           FALSE           FALSE         
# 2 Mary  F     Cookin… TRUE            FALSE           FALSE         
# 3 Sam   M     Running FALSE           FALSE           FALSE         
# 4 Tom   M     Drawin… FALSE           TRUE            TRUE          
# ... with 3 more variables: good_at_Joking <lgl>, good_at_Running <lgl>,
#   good_at_Swimming <lgl>

# end of script #

【讨论】:

    【解决方案2】:

    我创建了一个提供所需输出的函数:

    dum <- function(kw, col, type=c(T, F)) {
    t <- as.data.frame(grep(as.character(kw), col, ignore.case=T))
    t$one <- type[1]
    colnames(t) <- c("col1","dummy") 
    t2 <- as.data.frame(grep(as.character(kw), col, ignore.case=T,
      invert=T))
    t2$zero <- type[2]
    colnames(t2) <- c("col1","dummy")
    t3<-rbind(t, t2)
    t3<-t3[order(t3$col1), ]
    return(t3$dummy)
    }
    

    它可能不是超级优雅,但它确实有效。使用您的示例,您的数据框是 df 并且您尝试引用的列是 df$Good_at

    Drawing <- dum("drawing", df$Good_at)
    > Drawing
      TRUE
      FALSE
      ...
    
    Cooking <- dum("cooking", df$Good_at)
    > Cooking
      FALSE
      TRUE
      ...
    

    【讨论】:

    • 此函数在前三个列上有效,但第四列和后面的列无效,它显示:$&lt;-.data.frame(*tmp*, "one", value = TRUE) 中的错误:替换有 1 行,数据有 0 @mike
    • 如果您收到该错误,则表示您要搜索的关键字未出现在该列中。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2015-05-20
    • 1970-01-01
    • 1970-01-01
    • 2021-06-25
    • 2021-09-13
    • 1970-01-01
    相关资源
    最近更新 更多