按标签拆分文本并设置列名答案

【问题标题】：Split texts by tags and set column names按标签拆分文本并设置列名
【发布时间】：2021-09-03 22:15:11
【问题描述】：

我有一个带有标签样式的文本列。我想将此文本拆分为列，其中列名是具有相应值的标签。

text = "{\"article_id\":-41,\"word-count\":379,\"article_date\":05012017,\"source\":\"news::abc\",\"author\":\"Peter K\",\"title\":\"The rise of AI\",\"topics\":{\"Business\":10, \"Computer\":5},\"topics-group\":[{\"primary\":\"Business\",\"secondary\":\"Computer\"}]}"

期望的输出：

data = data.frame("article_id" = -41, "word-count" = 379, "article_date" = 05012017,
                  "source"= "news::abc", "author" = "Peter K", "title" = "The rise of AI",
                  "topics" = "{\"Business\":10, \"Computer\":5}", 
                  "topics-group" = "[{\"primary\":\"Business\",\"secondary\":\"Computer\"}]")

我试过strsplit

test = strsplit(as.character(text), ",\\\"")
test
[[1]]
[1] "{\"article_id\":-41"                        "word-count\":379"                          
[3] "article_date\":05012017"                    "source\":\"news::abc\""                    
[5] "author\":\"Peter K\""                       "title\":\"The rise of AI\""                
[7] "topics\":{\"Business\":10, \"Computer\":5}" "topics-group\":[{\"primary\":\"Business\"" 
[9] "secondary\":\"Computer\"}]}"

但是像topics-group这样的标签有问题，它被分成2个。

我的工作流程想法是完成拆分，然后对每个元素进行另一个拆分以分离标签和值。但我认为必须有更好的方法将这些标签的名称拆分和设置为列名。

【问题讨论】：

这是一个有点损坏的 JSON，在提供者方面修复它是有意义的。结构总是一样的吗？
查看使用rjson 库，但正如@Wiktor 所评论的，article_date 的05012017 值是八进制，JSON 不支持。将该值放在双引号中以使您的 JSON 通过验证。
现在我看到这是一个 JSON 类型的文件，我将进一步研究 rjson。
尝试text <- gsub('("article_date":)(\\d+)', '\\1"\\2"', text)，然后使用library(jsonlite)和document <- fromJSON(txt=text)
解析 JSON 后，您可以“手动”重新格式化日期字段。

标签： r regex split

【解决方案1】：

您可以在 article_date 字段周围添加双引号，并使用 jsonlite 解析 JSON 字符串：

text <- gsub('("article_date":)(\\d+)', '\\1"\\2"', text)

library(jsonlite)
document <- fromJSON(txt=text)
> as.data.frame(document)
#   article_id word.count article_date    source  author          title topics.Business topics.Computer topics.group.primary topics.group.secondary
# 1        -41        379     05012017 news::abc Peter K The rise of AI              10               5             Business               Computer

请参阅regex demo。详情：

("article_date":) - 第 1 组："article_date": 字符串
(\d+) - 第 2 组：一位或多位数字。

替换为\1"\2"：第1组值+第2组值，用双引号括起来。

【讨论】：

【解决方案2】：

我们可以在tidyverse 这样做

使用 str_replace_all 将 'article_date':' 之后的数字 (\\d+) 更改为 integer 类（因为开头有一个 0 填充）
使用fromJSON 将 JSON 转换为 R 对象
展平data.frame的嵌套列表-invoke
使用as_tibble 将list 转换为tibble
最后，使用mdy from lubridate 将'article_date' 转换为Date 类

library(dplyr)
library(stringr)
library(jsonlite)
library(lubridate)
library(purrr)
text %>%
     str_replace_all('(?<=article_date":)(\\d+)',  as.integer) %>%
     fromJSON %>% 
     invoke(c, .) %>%
     as_tibble %>% 
     mutate(article_date = mdy(article_date))

-输出

# A tibble: 1 x 10
  article_id `word-count` article_date source   author  title       topics.Business topics.Computer `topics-group.prima… `topics-group.second…
       <int>        <int> <date>       <chr>    <chr>   <chr>                 <int>           <int> <chr>                <chr>                
1        -41          379 2017-05-01   news::a… Peter K The rise o…              10               5 Business             Computer

【讨论】：