输入数据带来了几个挑战:
- 数据以直字符向量的形式给出,而不是带有预定义列的 data.frame。
- 行部分由键/值对组成,由
": " 分隔
- 其他行用作节标题。下面行中的所有键/值对都属于一个部分,直到到达下一个标题。
以下代码仅依赖于两个假设:
- 键/值对包含一个且只有一个
": "
- 完全没有节标题。
一个部分中的多个键,例如,具有电子邮件地址的多行通过将toString()指定为dcast()的聚合函数来处理。
library(data.table)
# coerce to data.table
data.table(text = txt)[
# split key/value pairs in columns
, tstrsplit(text, ": ")][
# pick section headers and create new column
is.na(V2), Name := V1][
# fill in Name into the rows below
, Name := zoo::na.locf(Name)][
# reshape key/value pairs from long to wide format using Name as row id
!is.na(V2), dcast(.SD, Name ~ V1, fun = toString, value.var = "V2")]
Name City/Town Email
1: Name1 Location1 email1@xyz.com
2: Name2 Location2 email2@abc.com
3: Name3 Location3 email3@pqr.com
4: Name4 Location4 NA
5: Name5 NA email5@abc.com
数据
txt <- c("Name1", "Email: email1@xyz.com", "City/Town: Location1", "Name2",
"Email: email2@abc.com", "City/Town: Location2", "Name3", "Email: email3@pqr.com",
"City/Town: Location3", "Name4", "City/Town: Location4", "Name5",
"Email: email5@abc.com")
或者,尝试更“真实”的名称
txt1 <- c("John Doe", "Email: email1@xyz.com", "City/Town: Location1", "Save the World Fund",
"Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com",
"City/Town: Location3", "Mother", "City/Town: Location4", "Jane",
"Email: email5@abc.com")
这将导致:
Name City/Town Email
1: Best Shoes Ltd. Location3 email3@pqr.com
2: Jane NA email5@abc.com
3: John Doe Location1 email1@xyz.com
4: Mother Location4 NA
5: Save the World Fund Location2 email2@abc.com
或者,每个部分有多个键
txt2 <- c("John Doe", "Email: email1@xyz.com", "Email: email1@abc.com", "City/Town: Location1", "Save the World Fund",
"Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com",
"City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane",
"Email: email5@abc.com")
Name City/Town Email
1: Best Shoes Ltd. Location3 email3@pqr.com
2: Jane email5@abc.com
3: John Doe Location1 email1@xyz.com, email1@abc.com
4: Mother Location4, everywhere
5: Save the World Fund Location2 email2@abc.com