【问题标题】:Split column to multiple fields using R使用 R 将列拆分为多个字段
【发布时间】:2017-08-15 02:30:12
【问题描述】:

我的 csv 中有一个列,其中包含一个字段“功能”。字段具有这种格式的数据

{""Air conditioning"",""Elevator"",""Smoke detector""}
{""Air conditioning"",""Railing Lights"",""Smoke detector""}
{""Air conditioning"",""Washer"",""Dryer"",""Smoke detector""}

它们是 20000 条记录,这些字符串位于“特征”字段中,没有任何特定顺序。

如何将它们拆分为不同的列,以使所有“空调”都属于第一列,电梯都属于第二列,依此类推。

          a          b       c              d            
air conditioning elevators smokedetectors 
air conditioning elevators smokedetectors washer
air conditioning elevators smokedetectors washer

【问题讨论】:

  • 检查 ?cSplit 来自 splitstackshape 包。
  • 您可以只使用read.csv(text = gsub('[{}]', '', txt), header = FALSE, quote = '""'),其中txt 是上面的文本作为单个字符串

标签: r dplyr text-mining stringr text-analysis


【解决方案1】:

来自tidyrseparate 和来自dplyrmutate_at 的组合(带有gsub):

dfr <- data.frame(features = c('{""Air conditioning"",""Elevator"",""Smoke detector""}',
                               '{""Air conditioning"",""Railing Lights"",""Smoke detector""}',
                               '{""Air conditioning"",""Washer"",""Dryer"",""Smoke detector""}'))

library(tidyr)
library(dplyr)

# Remove {,}, and quotes (")
fix_txt <- function(x)gsub("[{]\"|\"|[}]", "", x)
separate(dfr, features, c("a","b","c"), sep=",", extra="merge") %>%
mutate_at(vars(a:c), fix_txt)

给予

                 a              b                    c
1 Air conditioning       Elevator       Smoke detector
2 Air conditioning Railing Lights       Smoke detector
3 Air conditioning         Washer Dryer,Smoke detector

请注意,额外的字段已合并(如第三条记录),请查看?separate 了解更多选项。

【讨论】:

  • 谢谢。就像你会注意到在你的输出列 B 有电梯作为第 1 和洗衣机在第 3。怎么做才能让所有垫圈在一个柱子下,所有升降机在另一个柱子下。
  • 您最初的问题并没有真正表明这一点!我认为我们必须完全重新考虑解决方案。
【解决方案2】:

如前所述,您可以查看“splitstackshape”包,特别是cSplit_e 函数。有了它,你可以试试:

library(splitstackshape)
cSplit_e(as.data.table(dfr)[, features := (gsub("[{}\"]", "", features))], 
         "features", ",", mode = "value", type = "character", drop = TRUE)
##    features_Air conditioning features_Dryer features_Elevator features_Railing Lights features_Smoke detector features_Washer
## 1:          Air conditioning             NA          Elevator                      NA          Smoke detector              NA
## 2:          Air conditioning             NA                NA          Railing Lights          Smoke detector              NA
## 3:          Air conditioning          Dryer                NA                      NA          Smoke detector          Washer

“dfr”在@Remko 的回答中定义为:

dfr <- data.frame(features = c('{""Air conditioning"",""Elevator"",""Smoke detector""}',
                               '{""Air conditioning"",""Railing Lights"",""Smoke detector""}',
                               '{""Air conditioning"",""Washer"",""Dryer"",""Smoke detector""}'))

【讨论】:

    猜你喜欢
    • 2018-07-19
    • 2016-07-25
    • 1970-01-01
    • 2021-10-29
    • 2014-07-28
    • 1970-01-01
    • 1970-01-01
    • 2016-01-11
    • 2021-03-07
    相关资源
    最近更新 更多