在 R 中按第 n 个分隔符解析答案

【问题标题】：Parse By nth delimitor in R在 R 中按第 n 个分隔符解析
【发布时间】：2019-04-22 17:00:52
【问题描述】：

我有一个如下的数据框：

Col1    Col2
   A    5!5!!6!!3!!m
   B    7_8!!6!!7!!t

structure(list(Col1 = c("A", "B"), Col2 = c("5!5!!6!!3!!m", "7_8!!6!!7!!t" )), class = "data.frame", row.names = c(NA, -2L))

如何创建一个新列来提取在 Col2 中找到的字符串的第三次解析？

在 SQL 中，我使用的是 SPLIT_PART 函数：

SPLIT_PART(Col2, '!!', 3)

我正在寻找 R 中的等效函数。

预期输出：

Col1            Col2    Col3
   A    5!5!!6!!3!!m       3
   B    7_8!!6!!7!!t       7

【问题讨论】：

标签： r strsplit

【解决方案1】：

这是一个tidyverse 选项，尽管核心在功能上与Rushabh's data.table based answer 相同。

当给定simplify=T 参数时，stringr::str_split 将输出一个矩阵，每个匹配项在一列中。您可以从中提取所需的列以提取所需的位置：

library(tidyverse)

df1 %>%
    mutate(Col3 = str_split(Col2, pattern = '!!', simplify=T)[,3])

  Col1         Col2 Col3
1    A 5!5!!6!!3!!m  5!5
2    B 7_8!!6!!7!!t  7_8

df1 %>%
    mutate(Col3 = str_split(Col2, pattern = '!!', simplify=T)[,2])

  Col1         Col2 Col3
1    A 5!5!!6!!3!!m    6
2    B 7_8!!6!!7!!t    6

df1 %>%
  mutate(Col3 = str_split(Col2, pattern = '!!', simplify=T)[,1])

  Col1         Col2 Col3
1    A 5!5!!6!!3!!m  5!5
2    B 7_8!!6!!7!!t  7_8

【讨论】：

【解决方案2】：

您可以使用来自stringr 包的str_split-

> library(stringr)
> library(data.table)
> setDT(dt)[,Col3:=sapply(Col2,function(x) unlist(str_split(x,"!!"))[3])]

输出-

> dt
    Col1      Col2        Col3
1:    A   5!5!!6!!3!!m      3
2:    B   7_8!!6!!7!!t      7

注意- 您可以在function 中将position 从3rd 更改为nth。

【讨论】：

【解决方案3】：

我们可以使用str_extract来提取数字

library(stringr)
df1 %>%
  mutate(Col3 = as.numeric(str_extract(Col2, "\\d+(?=!![a-z]+$)")))
#  Col1         Col2 Col3
#1    A 5!5!!6!!3!!m    3
#2    B 7_8!!6!!7!!t    7

如果我们按位置需要，那么

df1$Col3 <- as.numeric(sapply(strsplit(df1$Col2, "!!", fixed = TRUE), `[`, 3))
df1$Col3
#[1] 3 7

或者使用gsubfn创建一个位置标识符，然后提取它之前的数字

library(gsubfn)
p <- proto(fun = function(this, x)  if(count == 3) paste0(";", x))
as.numeric(str_extract(gsubfn("(!!)", p, df1$Col2), "\\d+(?=;)"))
#[1] 3 7

数据

df1 <- structure(list(Col1 = c("A", "B"), Col2 = c("5!5!!6!!3!!m", "7_8!!6!!7!!t"
 )), class = "data.frame", row.names = c(NA, -2L))

【讨论】：

谢谢，你在哪里指定第三个解析？
我问的唯一原因是因为我必须通过第 n 个分隔符对不同的字符串进行多次解析。试图找到一种简单的方法来做到这一点