从r中的字符串中提取单词答案

【问题标题】：Extracting words from a string in r从r中的字符串中提取单词
【发布时间】：2018-04-27 16:50:13
【问题描述】：

我的数据框 df 的每一行中都有以下文本作为示例：

[{'id': 16, 'name': 'Soccer'}, {'id': 35, 'name': 'Basketball'}, {'id': 10751, 'name': 'Boxing'}]

有没有办法从这个文本中提取单词（足球、篮球、拳击）？抱歉，我是 R 中文本分析的新手。

【问题讨论】：

stringr::str_extract_all(string,"\\w+(?='\\})") 应该可以工作
它有效，但给了我类似 c("Soccer", "Basketball", "Boxing") 的答案。我怎样才能使它成为“足球”、“篮球”、“拳击”？
只做unlist(stringr::str_extract_all(string,"\\w+(?='\\})"))

标签： r text frame word

【解决方案1】：

看起来您有一个 JSON 输入字符串。可以用jsonlite::fromJSON解析JSON字符串，提取相关列name：

# Sample string
ss <- "[{'id': 16, 'name': 'Soccer'}, {'id': 35, 'name': 'Basketball'}, {'id': 10751, 'name': 'Boxing'}]";

# Parse JSON
library(jsonlite);
df <- fromJSON(txt = gsub("'", "\"", ss));

# Extract words
df$name;
#[1] "Soccer"     "Basketball" "Boxing"

【讨论】：

【解决方案2】：

可能类似于以下内容。

x <- "[{'id': 16, 'name': 'Soccer'}, {'id': 35, 'name': 'Basketball'}, {'id': 10751, 'name': 'Boxing'}]"
g <- gregexpr("[[:alpha:]]+", x)
y <- unlist(regmatches(x, g))
y[y != "id" & y != "name"]
#[1] "Soccer"     "Basketball" "Boxing"

最后一条指令的另一种可能性是使用%in%。

y[!y %in% c("id", "name")]
#[1] "Soccer"     "Basketball" "Boxing"

像这样，您可以有一个不需要的字符串向量，例如c("id", "name")，并避免使用长连词&。

【讨论】：

只做regmatches(x,gregexpr("\\w+(?='\\})",x,perl = T))
它有效，但给了我类似 c("Soccer", "Basketball", "Boxing") 的答案。我怎样才能使它成为“足球”、“篮球”、“拳击”？
@user36729 R 中的向量"Soccer", "Basketball", "Boxing" 是由组合函数c() 形成的。意思是，c("Soccer", "Basketball", "Boxing") 是一种创建向量的方法，而不是向量本身。当你说它给出答案c(etc)时，你能解释得更好吗？