创建连接子字符串的新列答案

【问题标题】：create new column of concatenated substrings创建连接子字符串的新列
【发布时间】：2020-10-30 01:11:51
【问题描述】：

为菜鸟问题道歉！

我希望能够使用 mutate 函数和 dplyr/stringr 的其他组合创建一个新列，以从“文件”列中提取文本子字符串并制作“图像”列，如下面的输出所示：

test<- data.frame(File= c("4301 TMA_Scan1_Core[1,2,A]_[10673,40057]_component_data.tif", "TA3150Scan1_Core[1,3,A][7006,42110]_component_data.tif"))

testoutput<- data.frame(File= c("4301 TMA_Scan1_Core[1,2,A]_[10673,40057]_component_data.tif", "TA3150Scan1_Core[1,3,A][7006,42110]_component_data.tif"),
                        Image = c("TA4301-2A", "TA3150-3A"))

非常感谢！

【问题讨论】：

您能解释一下从File 列中提取TA4301-1A 和TA3150-1A 的逻辑是什么吗？对于第一行，“TA4301”和“1-A”从哪里来？
TA4301-1A 等是与 MATLAB 中的下游分析兼容的每一行的标识符。数据集 >1e6 行。
共享示例的第一行没有TA4301-1A 。
第一行示例需要添加TA。第二行的例子没有。 1A 取自 [1,1,A] 的最后两个字符
这类似于 ekoam 在下面使用略有不同的正则表达式的答案：test$Image <- sub('(?:[A-Z]+)?(\\d+).*?\\[\\d+,(\\d+),([A-Z])\\].*', 'TA\\1-\\2\\3', test$File, perl = TRUE)

标签： r dplyr stringr

【解决方案1】：

这是你想要的吗？

test %>% 
  mutate(Image = sub("^\\D*(\\d+)[^][,]+\\[\\w+,(\\w+),(\\w+)\\].+", "TA\\1-\\2\\3", File))

输出

                                                         File     Image
1 4301 TMA_Scan1_Core[1,2,A]_[10673,40057]_component_data.tif TA4301-2A
2      TA3150Scan1_Core[1,3,A][7006,42110]_component_data.tif TA3150-3A

从左到右，

1. Match zero or more non-digit characters from the beginning
2. Match one or more digits; set it as the first capturing group
3. Match one or characters that are not "]", "[", or ","
4. Match the three values inside square brackets; set the last two as second and third capturing groups
5. Match remaining characters

^\\D*  (\\d+  )  [^][,]+       \\[\\w+,(\\w+),(\\w+)\\]   .+
   TA     3150   Scan1_Core      [  1 ,   3  ,   A    ]   [7006,42110]_component_data.tif
          4301   TMA_Scan1_Core  [  1 ,   2  ,   A    ]   _[10673,40057]_component_data.tif

【讨论】：