R - 拆分字符向量，以便将每个唯一元素添加到新的字符向量中答案

【问题标题】：R - Splitting character vector so that every unique element is added to a new character vectorR - 拆分字符向量，以便将每个唯一元素添加到新的字符向量中
【发布时间】：2016-04-27 22:18:04
【问题描述】：

我有一个字符向量，其中单个元素包含多个用逗号分隔的字符串。我通过从数据框中提取它获得了这个列表，它看起来像这样：

 [1] "Acworth, Crescent Lake, East Acworth, Lynn, South Acworth"                                                                              
 [2] "Ferncroft, Passaconaway, Paugus Mill"                                                                                                   
 [3] "Alexandria, South Alexandria"                                                                                                           
 [4] "Allenstown, Blodgett, Kenison Corner, Suncook (part)"                                                                                   
 [5] "Alstead, Alstead Center, East Alstead, Forristalls Corner, Mill Hollow"                                                                 
 [6] "Alton, Alton Bay, Brookhurst, East Alton, Loon Cove, Mount Major, South Alton, Spring Haven, Stockbridge Corners, West Alton, Woodlands"
 [7] "Amherst, Baboosic Lake, Cricket Corner, Ponemah"                                                                                        
 [8] "Andover, Cilleyville, East Andover, Halcyon Station, Potter Place, West Andover"                                                        
 [9] "Antrim, Antrim Center, Clinton Village, Loverens Mill, North Branch"                                                                    
[10] "Ashland"

我想获得一个新的字符向量，其中每个字符串都是该字符向量中的一个元素，即：

 [1] "Acworth", "Crescent Lake", "East Acworth", "Lynn", "South Acworth"                                                                              
 [6] "Ferncroft", "Passaconaway", "Paugus Mill", "Alexandria", "South Alexandria"

我使用了strsplit() 函数，但这会返回一个列表。当我尝试将其转换为字符向量时，它会恢复到旧状态。

我确信这是一个非常简单的问题 - 任何帮助将不胜感激！谢谢！

【问题讨论】：

在运行 strsplit 之后运行 unlist 而不是运行 as.character
多么简单！太棒了，非常感谢！
见the demo。顺便说一句，有空格 - 你想摆脱它们吗？
@WiktorStribiżew ", " 而不是 ","
@DavidArenburg：我在上面的链接中更新了演示。

标签： regex r vector strsplit

【解决方案1】：

您可以去掉空格并使用 "\\s*,\\s*" 正则表达式拆分字符向量，然后使用 unlist 结果：

v <- c("Acworth, Crescent Lake, East Acworth, Lynn, South Acworth", "Ferncroft, Passaconaway, Paugus Mill", "Alexandria, South Alexandria",  "Allenstown, Blodgett, Kenison Corner, Suncook (part)", "Alstead, Alstead Center, East Alstead, Forristalls Corner, Mill Hollow", "Alton, Alton Bay, Brookhurst, East Alton, Loon Cove, Mount Major, South Alton, Spring Haven, Stockbridge Corners, West Alton, Woodlands", "Amherst, Baboosic Lake, Cricket Corner, Ponemah",  "Andover, Cilleyville, East Andover, Halcyon Station, Potter Place, West Andover",  "Antrim, Antrim Center, Clinton Village, Loverens Mill, North Branch",  "Ashland" )
s <- unlist(strsplit(v, "\\s*,\\s*"))

见IDEONE demo

正则表达式匹配, 两侧的零个或多个空白符号（\s*），从而修剪值。即使在初始字符向量中在逗号之前有一个“百搭”空格，这也会处理这种情况。

【讨论】：

【解决方案2】：

您的帖子标题表明您需要唯一的字符串，所以

unique(unlist(strsplit(myvec, split=",")))

或

unique(unlist(strsplit(myvec, split=", ")))

如果逗号后面总是有空格。

【讨论】：

", " 而不是","，你可以添加fixed = TRUE
实际上，我没有看到只保留唯一字符串的要求，但是是的，可以这样做。
@DavidArenburg，我想我通常不会自动假设。稍后我会调用 sub() 来整理前导空白。
@WiktorStribiżew，我是从帖子标题中的“独特元素”中提取的，所以如果列表中有两次“新月湖”，那么输出中只需要一次。如果这不是所需的效果，请删除 unique()。
@DavidArenburg，固定 = TRUE？这是与标记问题已回答的 OP 不同的 stackoverflow “事物”吗？

【解决方案3】：

作为替代方案，您也可以使用scan，如下所示：

unique(scan(what = "", text = v, sep = ",", strip.white = TRUE))

strip.white = TRUE 部分负责处理您可能拥有的任何前导或尾随空格。

注意：“v”来自this other answer。

【讨论】：