从向量中提取唯一的部分元素答案

【问题标题】：Extracting unique partial elements from vector从向量中提取唯一的部分元素
【发布时间】：2014-09-10 21:58:59
【问题描述】：

我需要从以下文件夹的内容中获取唯一主题 ID 的列表（_ 之前和 / 之后的部分）。

[1] "."                      "./4101_0"               "./4101_0/4101 Baseline"
[4] "./4101_1"               "./4101_2"               "./4101_2_2"            
[7] "./4101_3"               "./4101_4"               "./4101_5"              
[10] "./4101_6"

现在我正在这样做（使用包 stringr 和 foreach）。

# Create list of contents
Folder.list <- list.dirs()
# Split entries by the "/"
SubIDs <- str_split(Folder.list, "/")
# For each entry in the list, retrieve the second element
SubIDs <- unlist(foreach(i=1:length(SubIDs)) %do% SubIDs[[i]][2])
# Split entries by the "_"
SubIDs <- str_split(SubIDs, "_")
# Take the second element after splitting, unlist it, find the unique entries, remove the NA and coerce to numeric
SubIDs <- as.numeric(na.omit(unique(unlist(foreach(i=1:length(SubIDs)) %do% SubIDs[[i]][1]))))

这可以完成工作，但似乎不必要地可怕。从 A 点到 B 点有什么更清洁的方式？

【问题讨论】：

标签： r

【解决方案1】：

stringr 还具有str_extract 函数，可用于提取匹配正则表达式模式的子字符串。通过对/ 的积极展望和对_ 的积极展望，您可以实现您的目标。

从@Andrie 的x 开始：

str_extract(x, perl('(?<=/)\\d+(?=_)'))

# [1] NA     "4101" "4101" "4101" "4101" "4101" "4101" "4101" "4101" "4101"

上面的模式匹配一个或多个数字（即\\d+），前面是正斜杠，后面是下划线。环顾四周需要将模式包装在 perl() 中。

【讨论】：

【解决方案2】：

使用 q 正则表达式。

x <- c(".", "./4101_0", "./4101_0/4101 Baseline", "./4101_1", "./4101_2", "./4101_2_2", "./4101_3", "./4101_4", "./4101_5", "./4101_6")

使用正则表达式的一种方法是使用gsub() 提取主题代码

gsub(".*/(\\d+)_.*", "\\1", x)
[1] "."    "4101" "4101" "4101" "4101" "4101" "4101" "4101" "4101" "4101"

【讨论】：

@Krysta：但请注意，如果未找到特定元素的模式，则返回该元素的原始字符串（与 . 一样）。