【发布时间】:2017-12-18 15:33:09
【问题描述】:
我有一个由一列组成的数据框。我想根据分隔符向量将文本分成单独的列。
输入:
Mypath<-"Hospital Number 233456 Patient Name: Jonny Begood DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely"
Mypath<-data.frame(Mypath)
names(Mypath)<- "PathReportWhole"
预期输出:
structure(list(PathReportWhole = structure(1L, .Label = "Hospital Number 233456 Patient Name: Jonny Begood\n DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely", class = "factor"),
HospitalNumber = " 233456 ", PatientName = " Jonny Begood",
DOB = " 13/01/77 ", GeneralPractitioner = NA_character_,
Dateofprocedure = NA_character_, ClinicalDetails = " Dyaphagia and reflux ",
Macroscopicdescription = " 3 pieces of oesophagus, all good biopsies\n ",
Histology = " These show chronic reflux and other bits n bobs\n ",
Diagnosis = " Acid reflux likely"), row.names = c(NA, -1L
), .Names = c("PathReportWhole", "HospitalNumber", "PatientName",
"DOB", "GeneralPractitioner", "Dateofprocedure", "ClinicalDetails",
"Macroscopicdescription", "Histology", "Diagnosis"), class = "data.frame")
我很想使用 tidyr 中的单独函数,但无法完全弄清楚它是否会根据分隔符列表进行分隔
列表将是:
mywords<-c("Hospital Number","Patient Name","DOB:","General Practitioner:","Date of Procedure:","Clinical Details:","Macroscopic description:","Histology:","Diagnosis:")
然后我尝试了:
Mypath %>% separate(Mypath, mywords)
但我显然误解了我认为无法获取分隔符列表的函数
Error: `var` must evaluate to a single number or a column name, not a list
有没有使用 tidyr 的简单方法(或csplit 与列表或任何其他方式)
【问题讨论】:
-
这是一个开始:
strsplit(as.character(Mypath$PathReportWhole), paste(mywords, collapse = "|")) -
数据是否总是相同的格式(顺序)并且包含所有部分?
-
既然你提到了
cSplit,这里有一个想法:setNames(cSplit(transform(Mypath, PathReportWhole = gsub(paste(mywords, collapse = '|'), '-', PathReportWhole)), 'PathReportWhole', '-', 'wide'), c('PathReportWhole' ,mywords))... 不是 100%,但你可以稍微改进一下 -
@docendodiscimus 数据并不总是按相同的顺序排列,并且并非总是包含所有部分,但我希望在这种情况下它不会失败任何行 - 也许它只是插入 NA成一列。我稍后会测试
-
嗯,这可能使它更复杂一点。您可以同时尝试 tidyr-way:
Mypath %>% separate(PathReportWhole, into = mywords, sep = paste(mywords, collapse = "|"))。
标签: r