【发布时间】:2021-09-26 18:04:36
【问题描述】:
这感觉像是 R 中一个相当困难的数据操作/数据框修复问题。我们有以下凌乱的数据框,目前的组织方式是将多列信息打包到 X2 列中。在下面的示例中使用假名、电子邮件、电话号码:
coach_info <- structure(list(X1 = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), X2 = c("TBA\r\n Head Coach", "Bobby Flowes\r\n Associate Head Women's Basketball Coach",
"Jimmy Jimm\r\n Assistant Women's Basketball Coach", "Rod Barber\r\n Head Men's Basketball Coach\r\n (123) 456-7890Tom.Tommy@abc.edu",
NA, "Gabens Spar\r\n Men's Basketball Graduate Assistant Coachgabensspar@gmail.edu",
"A.B. Better\r\n Head Women's Basketball Coach/Head Men's Golf Coach/Sports Information Associateabbetter@gmail.edu\r\n 111-222-3333",
"Nick Romanov\r\n Head Crew Coach\r\n nick.nick@school.edu\r\n 123-123-1234",
"Name Lasttt\r\n Assistant Coach")), row.names = c(1L, 2L, 3L,
7L, 12L, 16L, 17L, 25L, 29L), class = "data.frame")
head(coach_info, 4)
X1 X2
1 <NA> TBA\r\n Head Coach
2 <NA> Bobby Flowes\r\n Associate Head Women's Basketball Coach
3 <NA> Jimmy Jimm\r\n Assistant Women's Basketball Coach
7 <NA> Rod Barber\r\n Head Men's Basketball Coach\r\n (123) 456-7890Tom.Tommy@abc.edu
我们正在尝试将X2 列信息拆分为Name、Title、Email 和Phone 的4 列。当我们strsplit(coach_info$X2, '\r\n') 时,我们得到的是一个凌乱的嵌套列表,而使用\r\n 的拆分是不完美的,因为在某些行中缺少\r\n:
除此之外,每个内部嵌套列表都有不同数量的元素,因为许多行缺少一个或多个姓名、电话号码或电子邮件地址:
> unlist(lapply(strsplit(coach_info$X2, '\r\n'), length))
[1] 2 2 2 3 1 2 3 4 2
我们的目标是尽可能接近这个:
output_df <- data.frame(
Name = c('TBA', 'Bobby Flowes', 'Jimmy Jimm', 'Rod Barber', NA, 'Gaben Spar', 'A.B. Better', 'Nick Romanov', 'Name Lasttt'),
Title = c('Head Coach', "Associate Head Women's Basketball Coach", "Assistant Women's Basketball Coach", "Head Men's Basketball Coach",
NA, " Men's Basketball Graduate Assistant", "Head Women's Basketball Coach/Head Men's Golf Coach/Sports Information Associate",
"Head Crew Coach", "Assistant Coach"),
Email = c(NA, NA, NA, "Tom.Tommy@abc.edu", NA, "Coachgabensspar@gmail.edu", "abbetter@gmail.edu", "nick.nick@school.edu", NA),
Phone = c(NA, NA, NA, "(123) 456-7890", NA, NA, "111-222-3333", "123-123-1234", NA),
stringsAsFactors = FALSE
)
> head(output_df, 4)
Name Title Email Phone
1 TBA Head Coach <NA> <NA>
2 Bobby Flowes Associate Head Women's Basketball Coach <NA> <NA>
3 Jimmy Jimm Assistant Women's Basketball Coach <NA> <NA>
4 Rod Barber Head Men's Basketball Coach Tom.Tommy@abc.edu (123) 456-7890
在不同字段之间不存在空格或\r\n 的情况下,似乎不可能干净地拆分字符串,如上面的屏幕截图所示。在这一点上,我们只是试图尽可能接近......
【问题讨论】:
标签: r dataframe data-manipulation strsplit