如何将 str_extract_all 变成多列答案

【问题标题】：How to turn str_extract_all into multiple columns如何将 str_extract_all 变成多列
【发布时间】：2018-08-08 21:25:28
【问题描述】：

这是正文：

  data$charge[1]
  [1] "Count #1 as Filed: In Violation of; 21 O.S. 645; Count #2 as Filed: In Violation of; 21 O.S. 1541.1;Docket 1"

我目前正在尝试从法律数据中提取法规。我的代码如下所示：

str_extract_all(data$charge[1:3], "(?<=Violation of;)(\\D|\\d){4,20}(?=;Count |;Docket)") 

[[1]]
[1] "21 O.S. 645"      "21 O.S. 1541.1"

[[2]]
[1]  "21 O.S. 1435     "21 O.S. 1760(A)(1)

[[3]]
[1]   "21 O.S. 1592"

我想将它们作为列添加到这样的数据框中：

id           name           statute1           statute2           statute3
1           BLACK, JOHN     21 O.S. 645        21 O.S. 1541.1     NA
2           DOE, JANE       21 O.S. 1435       21 O.S. 1760(A)(1) NA
3           ROSS, BOB       21 O.S. 1592       NA                 NA

谢谢！这有意义吗？

【问题讨论】：

我认为我们可以使用可重现的示例。
你的意思是我从中提取的文本吗？
是的，如果我们不能重新创建它，我们就无法解决您的问题。阅读How to make a great R reproducible example

标签： r stringr

【解决方案1】：

由于您没有包含数据或预期输出的可重现示例，我无法确定，但我认为您正在寻找的是 simplify = TRUE 的 str_extract_all 参数。

来自?str_extract_all上的例子：

shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")

# without simplify = TRUE
str_extract_all(shopping_list, "\\b[a-z]+\\b")
[[1]]
[1] "apples"

[[2]]
[1] "bag"   "of"    "flour"

[[3]]
[1] "bag"   "of"    "sugar"

[[4]]
[1] "milk"

# with simplify = TRUE
str_extract_all(shopping_list, "\\b[a-z]+\\b", simplify = TRUE)
     [,1]     [,2] [,3]   
[1,] "apples" ""   ""     
[2,] "bag"    "of" "flour"
[3,] "bag"    "of" "sugar"
[4,] "milk"   ""   ""

使用您添加的示例：

dat <- "Count #1 as Filed: In Violation of; 21 O.S. 645; Count #2 as Filed: In Violation of; 21 O.S. 1541.1;Docket 1"

str_extract_all(dat, "(?<=Violation of;)(\\D|\\d){4,20}(?=;Count |;Docket)",
                simplify = TRUE)

     [,1]             
[1,] " 21 O.S. 1541.1"

【讨论】：

实际上这行得通！谢谢！现在您知道如何将此输出转换为数据框列了吗？
您确定没有拼写错误并且您使用的是str_extract_all？当您使用函数无法识别的参数时会发生该错误，这通常是由于拼写错误或括号放错位置导致该参数与与您预期不同的函数相关联。
不，你是对的，simplify=TRUE 有效！我只需要将输出转换为数据框列

【解决方案2】：

这不是迄今为止最有效的解决方案，但与其他解决方案相比，我可以理解：

df = tribble(
  ~foo,
  "1,2",
  "3,4"
)

df %>% mutate(
  col1 = str_extract_all(foo, "\\d+", simplify = TRUE)[,1],
  col2 = str_extract_all(foo, "\\d+", simplify = TRUE)[,2],
)

# A tibble: 2 x 3
  foo   col1  col2 
  <chr> <chr> <chr>
1 1,2   1     2    
2 3,4   3     4

【讨论】：

【解决方案3】：

您可以使用tidyverse 包来做到这一点。您的示例中的正则表达式模式不适用于提供的某些示例文本，因为它始终需要一个尾随分号。下面使用的模式应该更简单，但可能需要根据实际文本进行一些调整。

library(tidyverse)

df %>% 
  mutate(charges = str_extract_all(charge, "(?<=Violation of;\\s).+?(?=(;|$))")) %>% # extracts the different charges
  select(-charge) %>%  # dropping the raw text can be skipped
  unnest(charges) %>%  # seperates the different charges for each name
  group_by(name) %>%   # in this sample there is only a name, but hopefully the real data has some sort of unique id - there could be lots of Jane Doe's in this data
  mutate(statute = paste0('statute', row_number())) %>% # adds a statute number to each charge
  spread(statute, charges) # shift the data from long to wide

# A tibble: 3 x 3
# Groups:   name [3]
  name       statute1        statute2             
  <chr>      <chr>           <chr>                
1 BLACK,JOHN 21 O.S. 645  21 O.S. 1541.1    
2 DOE, JANE  21 O.S. 1435 21 O.S. 1760(A)(1)
3 ROSS, BOB  21 O.S. 1592 NA

样本数据：

df <- data_frame(name = c('BLACK,JOHN', 'DOE, JANE', 'ROSS, BOB'), 
                 charge = c('Count #1 as Filed: In Violation of; 21 O.S. 645; Count #2 as Filed: In Violation of; 21 O.S. 1541.1;Docket 1',
                            'Count #3 as Filed: In Violation of; 21 O.S. 1435; Count #4 as Filed: In Violation of; 21 O.S. 1760(A)(1)',
                            'Count #2 as Filed: In Violation of; 21 O.S. 1592'))

【讨论】：