如何根据分隔符的字符向量将一列分成多列答案

【问题标题】：How to separate one colmn into multiple based on a character vector of delimiters如何根据分隔符的字符向量将一列分成多列
【发布时间】：2017-12-18 15:33:09
【问题描述】：

我有一个由一列组成的数据框。我想根据分隔符向量将文本分成单独的列。

输入：

Mypath<-"Hospital Number 233456 Patient Name: Jonny Begood  DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely"
Mypath<-data.frame(Mypath)
names(Mypath)<- "PathReportWhole"

预期输出：

structure(list(PathReportWhole = structure(1L, .Label = "Hospital Number 233456 Patient Name: Jonny Begood\n    DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely", class = "factor"), 
    HospitalNumber = " 233456 ", PatientName = " Jonny Begood", 
    DOB = " 13/01/77 ", GeneralPractitioner = NA_character_, 
    Dateofprocedure = NA_character_, ClinicalDetails = " Dyaphagia and reflux ", 
    Macroscopicdescription = " 3 pieces of oesophagus, all good biopsies\n ", 
    Histology = " These show chronic reflux and other bits n bobs\n ", 
    Diagnosis = " Acid reflux likely"), row.names = c(NA, -1L
), .Names = c("PathReportWhole", "HospitalNumber", "PatientName", 
"DOB", "GeneralPractitioner", "Dateofprocedure", "ClinicalDetails", 
"Macroscopicdescription", "Histology", "Diagnosis"), class = "data.frame")

我很想使用 tidyr 中的单独函数，但无法完全弄清楚它是否会根据分隔符列表进行分隔

列表将是：

mywords<-c("Hospital Number","Patient Name","DOB:","General Practitioner:","Date of Procedure:","Clinical Details:","Macroscopic description:","Histology:","Diagnosis:")

然后我尝试了：

Mypath %>% separate(Mypath, mywords)

但我显然误解了我认为无法获取分隔符列表的函数

Error: `var` must evaluate to a single number or a column name, not a list

有没有使用 tidyr 的简单方法（或csplit 与列表或任何其他方式）

【问题讨论】：

这是一个开始：strsplit(as.character(Mypath$PathReportWhole), paste(mywords, collapse = "|"))
数据是否总是相同的格式（顺序）并且包含所有部分？
既然你提到了cSplit，这里有一个想法：setNames(cSplit(transform(Mypath, PathReportWhole = gsub(paste(mywords, collapse = '|'), '-', PathReportWhole)), 'PathReportWhole', '-', 'wide'), c('PathReportWhole' ,mywords))... 不是 100%，但你可以稍微改进一下
@docendodiscimus 数据并不总是按相同的顺序排列，并且并非总是包含所有部分，但我希望在这种情况下它不会失败任何行 - 也许它只是插入 NA成一列。我稍后会测试
嗯，这可能使它更复杂一点。您可以同时尝试 tidyr-way：Mypath %>% separate(PathReportWhole, into = mywords, sep = paste(mywords, collapse = "|"))。

标签： r

【解决方案1】：

也许确保它像一个dcf文件，你可以使用read.dcf：

请注意，“mywords”与您的略有不同。我在“医院编号”和“患者姓名”中添加了冒号。

mywords<-c("Hospital Number:","Patient Name:","DOB:","General Practitioner:",
           "Date of Procedure:","Clinical Details:","Macroscopic description:",
           "Histology:","Diagnosis:")

将相关列转换为字符，在“Hospital Number”后面加一个冒号。

Mypath$PathReportWhole <- as.character(Mypath$PathReportWhole)
Mypath$PathReportWhole <- gsub("Hospital Number", "Hospital Number:", Mypath$PathReportWhole)

使每个 key: value 对都在自己的行上。

temp <- gsub(sprintf("(%s)", paste(mywords, collapse = "|")), "\n\\1", Mypath$PathReportWhole)

使用read.dcf 读入：

out <- read.dcf(textConnection(temp))

以下是一些示例数据，可以更轻松地查看生成的结构：

example <- c("var 1 abc var 2: some, text var 3: 112 var 4: value var 5: even more here",
            "var 1 xyz var 2: more text here var 5: not all values are there")
example <- data.frame(report = example)
example
#                                                                      report
# 1 var 1 abc var 2: some, text var 3: 112 var 4: value var 5: even more here
# 2           var 1 xyz var 2: more text here var 5: not all values are there

并且，执行相同的步骤：

mywords <- c("var 1:", "var 2:", "var 3:", "var 4:", "var 5:")
temp <- as.character(example$report)
temp <- gsub("var 1", "var 1:", temp)
temp <- gsub(sprintf("(%s)", paste(mywords, collapse = "|")), "\n\\1", temp)
read.dcf(textConnection(temp))
#      var 1 var 2            var 3 var 4   var 5                     
# [1,] "abc" "some, text"     "112" "value" "even more here"          
# [2,] "xyz" "more text here" NA    NA      "not all values are there"

read.dcf(textConnection(temp), fields = c("var 1", "var 3", "var 5"))
#      var 1 var 3 var 5                     
# [1,] "abc" "112" "even more here"          
# [2,] "xyz" NA    "not all values are there"

【讨论】：