使用 R 将单个列拆分为多个观察值答案

【问题标题】：Splitting a single column into multiple observation using R使用 R 将单个列拆分为多个观察值
【发布时间】：2016-01-11 19:54:09
【问题描述】：

我正在处理 HCUP 数据，它在一列中有一系列值，需要拆分为多列。以下是 HCUP 数据框供参考：

code            label
61000-61003     excision of CNS
0169T-0169T     ventricular shunt

所需的输出应该是：

code            label
61000           excision of CNS
61001           excision of CNS
61002           excision of CNS
61003           excision of CNS
0169T           ventricular shunt

我解决这个问题的方法是使用包 splitstackshape 并使用此代码

library(data.table)
library(splitstackshape)

cSplit(hcup, "code", "-")[, list(code = code_1:code_2, by = label)]

这种方法会导致内存问题。有没有更好的方法来解决这个问题？

一些cmets：

数据中除了“T”之外还有很多字母。
字母可以在前面或最后，但不能在两个数字之间。
在一个范围内没有从“T”到“U”的字母变化

【问题讨论】：

嗯，我对 data.table 不是很有经验，但我看不出你的方法是如何工作的 - Code_1（不应该是 code_1 吗？）和 code_2必须是数字，如果你想建立一个序列，例如hcup <- read.table(header=T, stringsAsFactors = F, text="code label\n61000-61003 excision_of_CNS\n0169T-0169T ventricular_shunt"); cSplit(hcup, "code", "-")[, list(Code = seq(as.integer(gsub("\\D", "", code_1)), as.integer(gsub("\\D", "", code_2)))), by = label].
谢谢。我已接受编辑。我并不特别关注“splitstackshape”。有没有可能写一个函数来处理这个问题？
这可能对splitstackshape 文档有所帮助：如果您知道列中的所有值在拆分后每行具有相同数量的值，您应该使用cSplit_f 函数相反，它使用fread 而不是strsplit 并且通常更快。
所以也许你可以给我们更多的信息。字母T 总是字母？它总是在字符串的末尾吗？
对这个问题进行第二次猜测，我认为扩展数据框可能不是您最终想要做的事情。将代码列拆分为begin 和end，并存储code.prefix 和code.suffix 似乎会使匹配变得简单很多，这大概是针对的用例。

标签： r data.table medical data-cleaning splitstackshape

【解决方案1】：

这是使用来自Hmisc 的dplyr 和all.is.numeric 的解决方案：

library(dplyr)
library(Hmisc)
library(tidyr)
dat %>% separate(code, into=c("code1", "code2")) %>%
        rowwise %>%
        mutate(lists = ifelse(all.is.numeric(c(code1, code2)),
                         list(as.character(seq(from = as.numeric(code1), to = as.numeric(code2)))),
                         list(code1))) %>%
        unnest(lists) %>%
        select(code = lists, label)

Source: local data frame [5 x 2]

   code             label
  (chr)            (fctr)
1 61000   excision of CNS
2 61001   excision of CNS
3 61002   excision of CNS
4 61003   excision of CNS
5 0169T ventricular shunt

使用字符值修复范围的编辑。稍微降低了简单性：

dff %>% mutate(row = row_number()) %>%
        separate(code, into=c("code1", "code2")) %>%
        group_by(row) %>%
        summarise(lists = if(all.is.numeric(c(code1, code2)))
                              {list(str_pad(as.character(
                                   seq(from = as.numeric(code1), to = as.numeric(code2))),
                                         nchar(code1), pad="0"))}
                          else if(grepl("^[0-9]", code1))
                              {list(str_pad(paste0(as.character(
                                   seq(from = extract_numeric(code1), to = extract_numeric(code2))),
                                      strsplit(code1, "[0-9]+")[[1]][2]),
                                         nchar(code1), pad = "0"))}
                          else
                              {list(paste0(
                                      strsplit(code1, "[0-9]+")[[1]],
                                      str_pad(as.character(
                                    seq(from = extract_numeric(code1), to = extract_numeric(code2))),
                                         nchar(gsub("[^0-9]", "", code1)), pad="0")))},
                   label = first(label)) %>%
        unnest(lists) %>%
        select(-row)
Source: local data frame [15 x 2]

               label lists
               (chr) (chr)
1    excision of CNS 61000
2    excision of CNS 61001
3    excision of CNS 61002
4  ventricular shunt 0169T
5  ventricular shunt 0170T
6  ventricular shunt 0171T
7    excision of CNS 01000
8    excision of CNS 01001
9    excision of CNS 01002
10    some procedure A2543
11    some procedure A2544
12    some procedure A2545
13    some procedure A0543
14    some procedure A0544
15    some procedure A0545

数据：

dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002", 
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt", 
"excision of CNS", "some procedure", "some procedure")), .Names = c("code", 
"label"), row.names = c(NA, 5L), class = "data.frame")

【讨论】：

这看起来不错。但它在最终输出中省略了“0169T”之类的代码。
这个解决方案非常接近，但仍然错过了字母在前的那些代码。例如，代码“A4245”不会添加到最终数据库中。

【解决方案2】：

一种不太优雅的方式：

# the data
hcup <- data.frame(code=c("61000-61003", "0169T-0169T"),
                   label=c("excision of CNS", "ventricular shunt"), stringsAsFactors = F)
hcup
>         code             label
>1 61000-61003   excision of CNS
>2 0169T-0169T ventricular shunt

# reshaping
# split the code ranges into separate columns
seq.ends <- cbind(do.call(rbind.data.frame, strsplit(hcup$code, "-")), hcup$label)
# create a list with a data.frame for each original line
new.list <- apply(seq.ends, 1, FUN=function(x){data.frame(code=if(grepl("\\d{5}", x[1])){
                     z<-x[1]:x[2]}else{z<-x[1]}, label=rep(x[3], length(z)),
                     stringsAsFactors = F)})
# collapse the list into a df
new.df <- do.call(rbind, lapply(new.list, data.frame, stringsAsFactors=F))

new.df
>     code             label
>1.1 61000   excision of CNS
>1.2 61001   excision of CNS
>1.3 61002   excision of CNS
>1.4 61003   excision of CNS
>2   0169T ventricular shunt

【讨论】：

【解决方案3】：

原答案：更新见下文。

首先，我通过将第一行添加到底部，使您的示例数据更具挑战性。

dff <- structure(list(code = c("61000-61003", "0169T-0169T", "61000-61003"
), label = c("excision of CNS", "ventricular shunt", "excision of CNS"
)), .Names = c("code", "label"), row.names = c(NA, 3L), class = "data.frame")

dff
#          code             label
# 1 61000-61003   excision of CNS
# 2 0169T-0169T ventricular shunt
# 3 61000-61003   excision of CNS

我们可以使用序列运算符: 来获取code 列的序列，用tryCatch() 包裹，这样我们就可以避免错误，并保存无法排序的值。首先，我们用破折号- 拆分值，然后通过lapply() 运行它。

xx <- lapply(
    strsplit(dff$code, "-", fixed = TRUE), 
    function(x) tryCatch(x[1]:x[2], warning = function(w) x)
)
data.frame(code = unlist(xx), label = rep(dff$label, lengths(xx)))
#     code             label
# 1  61000   excision of CNS
# 2  61001   excision of CNS
# 3  61002   excision of CNS
# 4  61003   excision of CNS
# 5  0169T ventricular shunt
# 6  0169T ventricular shunt
# 7  61000   excision of CNS
# 8  61001   excision of CNS
# 9  61002   excision of CNS
# 10 61003   excision of CNS

我们正在尝试将序列运算符: 应用于strsplit() 中的每个元素，如果无法使用x[1]:x[2]，则仅返回这些元素的值并继续使用序列x[1]:x[2] 否则.然后我们只是根据xx 中的结果长度复制label 列的值，以获得新的label 列。

更新：以下是我针对您的编辑提出的意见。将上面的xx 替换为

xx <- lapply(strsplit(dff$code, "-", TRUE), function(x) {
    s <- stringi::stri_locate_first_regex(x, "[A-Z]")
    nc <- nchar(x)[1L]
    fmt <- function(n) paste0("%0", n, "d")
    if(!all(is.na(s))) {
        ss <- s[1,1]
        fmt <- fmt(nc-1)
        if(ss == 1L) {
            xx <- substr(x, 2, nc)
            paste0(substr(x, 1, 1), sprintf(fmt, xx[1]:xx[2]))
        } else {
            xx <- substr(x, 1, ss-1)
            paste0(sprintf(fmt, xx[1]:xx[2]), substr(x, nc, nc))
        }
    } else {
        sprintf(fmt(nc), x[1]:x[2])
    }
})

是的，这很复杂。现在如果我们把下面的数据框df2作为一个测试用例

df2 <- structure(list(code = c("61000-61003", "0169T-0174T", "61000-61003", 
"T0169-T0174"), label = c("excision of CNS", "ventricular shunt", 
"excision of CNS", "ventricular shunt")), .Names = c("code", 
"label"), row.names = c(NA, 4L), class = "data.frame")

并在上面运行xx代码，我们可以得到以下结果。

data.frame(code = unlist(xx), label = rep(df2$label, lengths(xx)))
#     code             label
# 1  61000   excision of CNS
# 2  61001   excision of CNS
# 3  61002   excision of CNS
# 4  61003   excision of CNS
# 5  0169T ventricular shunt
# 6  0170T ventricular shunt
# 7  0171T ventricular shunt
# 8  0172T ventricular shunt
# 9  0173T ventricular shunt
# 10 0174T ventricular shunt
# 11 61000   excision of CNS
# 12 61001   excision of CNS
# 13 61002   excision of CNS
# 14 61003   excision of CNS
# 15 T0169 ventricular shunt
# 16 T0170 ventricular shunt
# 17 T0171 ventricular shunt
# 18 T0172 ventricular shunt
# 19 T0173 ventricular shunt
# 20 T0174 ventricular shunt

【讨论】：

这很好用。但是输入数据有类似“'0005T-0006T”的代码。在这种情况下，最终输出中只有 0005T 被标记，但缺少代码 0006T。
我很抱歉，数据集很大，我错过了。是的，我希望在最终输出中同时包含这两个代码。
不确定您的示例是否可行。我猜每个标签在原始数据中只出现一次。

【解决方案4】：

为此类代码创建排序规则：

seq_code <- function(from,to){

    ext = function(x, part) gsub("([^0-9]?)([0-9]*)([^0-9]?)", paste0("\\",part), x)

    pre = unique(sapply(list(from,to), ext, part = 1 ))
    suf = unique(sapply(list(from,to), ext, part = 3 ))

    if (length(pre) > 1 | length(suf) > 1){
        return("NO!")
    }

    num = do.call(seq, lapply(list(from,to), function(x) as.integer(ext(x, part = 2))))
    len = nchar(from)-nchar(pre)-nchar(suf)

    paste0(pre, sprintf(paste0("%0",len,"d"), num), suf)

}

以@jeremycg 为例：

setDT(dff)[,.(
  label = label[1], 
  code  = do.call(seq_code, tstrsplit(code,'-'))
), by=.(row=seq(nrow(dff)))]

给了

    row             label  code
 1:   1   excision of CNS 61000
 2:   1   excision of CNS 61001
 3:   1   excision of CNS 61002
 4:   2 ventricular shunt 0169T
 5:   2 ventricular shunt 0170T
 6:   2 ventricular shunt 0171T
 7:   3   excision of CNS 01000
 8:   3   excision of CNS 01001
 9:   3   excision of CNS 01002
10:   4    some procedure A2543
11:   4    some procedure A2544
12:   4    some procedure A2545
13:   5    some procedure A0543
14:   5    some procedure A0544
15:   5    some procedure A0545

从@jeremycg 的回答中复制的数据：

dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002", 
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt", 
"excision of CNS", "some procedure", "some procedure")), .Names = c("code", 
"label"), row.names = c(NA, 5L), class = "data.frame")

【讨论】：

【解决方案5】：

如果您有足够的耐心，您可能会将字符串解析为单独的片段，而不是使用 eval/parse 技巧，可惜我不是，所以：

fancy.seq = function(x) eval(parse(text=sub(', \\)', ')', sub('\\(, ', '(',
               sub('.*?([0-9]+)(.*)-(.*?)([1-9][0-9]*).*',
                   'paste0("\\3",
                           formatC(\\1:\\4, width=log10(\\4)+1, format="d", flag="0"),
                           "\\2")',
                   x)))))
# using example from jeremycg's answer
dt[, .(fancy.seq(code), label), by = 1:nrow(dt)]
#    nrow    V1             label
# 1:    1 61000   excision of CNS
# 2:    1 61001   excision of CNS
# 3:    1 61002   excision of CNS
# 4:    2 0169T ventricular shunt
# 5:    2 0170T ventricular shunt
# 6:    2 0171T ventricular shunt
# 7:    3 01000   excision of CNS
# 8:    3 01001   excision of CNS
# 9:    3 01002   excision of CNS
#10:    4 A2543    some procedure
#11:    4 A2544    some procedure
#12:    4 A2545    some procedure
#13:    5 A0543    some procedure
#14:    5 A0544    some procedure
#15:    5 A0545    some procedure

如果不清楚以上内容在做什么 - 只需对“代码”字符串之一逐一运行 sub 命令。

【讨论】：

\\2:\\4 太棒了！