在 R 中拆分和提取字符串答案

【问题标题】：Splitting and extracting Strings in R在 R 中拆分和提取字符串
【发布时间】：2015-05-07 09:40:56
【问题描述】：

规则

{Denny 煎锅} => {Denny C-Size 电池}

{Denny Scented Tissue} => {Denny Paper Plates}

{Blue Label 花式罐头蛤蜊} => {Blue Label 罐装金枪鱼水}

{丹尼塑料叉} => {黄金冷冻豌豆}

{Denny 煎锅} => {Denny D-Size 电池}

{Denny Plastic Forks} => {Faux Products Apricot Shampoo}

{Golden Frozen Peas} => {Denny Plastic Forks}

{Faux Products Apricot Shampoo} => {Denny Plastic Forks}

{Blue Label 罐装金枪鱼水} => {Blue Label 花式罐头蛤蜊}

{Blue Label Canned String Beans} => {Faux Products Buffered Aspirin}

{Denny D 型电池} => {Denny 煎锅}

我有一个如上所述的单列数据框。我想把上面的规则分成LHS和RHS

LHS 应包含 {} 之前 => 之间的字符同样，RHS 应该包含在 =>

之后的下一个 {} 之间的字符

我想知道如何在 R 中做到这一点？

【问题讨论】：

标签： r string substring extract substr

【解决方案1】：

RULES <- c("{Denny Frying Pan} => {Denny C-Size Batteries}",
           "{Denny Scented Tissue} => {Denny Paper Plates}",
           "{Blue Label Fancy Canned Clams} => {Blue Label Canned Tuna in Water}",
           "{Denny Plastic Forks} => {Golden Frozen Peas}",
           "{Denny Frying Pan} => {Denny D-Size Batteries}",
           "{Denny Plastic Forks} => {Faux Products Apricot Shampoo}",
           "{Golden Frozen Peas} => {Denny Plastic Forks}",
           "{Faux Products Apricot Shampoo} => {Denny Plastic Forks}",
           "{Blue Label Canned Tuna in Water} => {Blue Label Fancy Canned Clams}",
           "{Blue Label Canned String Beans} => {Faux Products Buffered Aspirin}",
           "{Denny D-Size Batteries} => {Denny Frying Pan}")

df <- as.data.frame(do.call(rbind,strsplit(RULES,"} => {",fixed=TRUE)))
df[,1] <- gsub("{","",df[,1],fixed = TRUE)
df[,2] <- gsub("}","",df[,2],fixed = TRUE)

df
                                V1                              V2
1                 Denny Frying Pan          Denny C-Size Batteries
2             Denny Scented Tissue              Denny Paper Plates
3    Blue Label Fancy Canned Clams Blue Label Canned Tuna in Water
4              Denny Plastic Forks              Golden Frozen Peas
5                 Denny Frying Pan          Denny D-Size Batteries
6              Denny Plastic Forks   Faux Products Apricot Shampoo
7               Golden Frozen Peas             Denny Plastic Forks
8    Faux Products Apricot Shampoo             Denny Plastic Forks
9  Blue Label Canned Tuna in Water   Blue Label Fancy Canned Clams
10  Blue Label Canned String Beans  Faux Products Buffered Aspirin
11          Denny D-Size Batteries                Denny Frying Pan

【讨论】：

df {",fixed=TRUE))) strsplit(RULES, "} => 错误{", fixed = TRUE) : 非字符参数
@Nimish Jain 可能是因为规则是一个因素。试试RULES <- as.character(RULES)
df[,2] [.data.frame(df, , 2) 中的错误：选择了未定义的列
你的向量RULE有第一行"RULE"吗？我以为是标题。如果是这样，请删除第一行。使用我在答案中添加的数据，它可以工作

【解决方案2】：

您可以尝试以下方法之一。两者都假设您从一个名为“规则”的字符向量开始。如果“规则”已经是您的data.frame 中的一列，您需要稍作修改。

library(splitstackshape)
library(dplyr)

data.table(rules = gsub("[{}]", "", gsub("=>", "\t", rules))) %>%
  cSplit("rules", "\t")
#                             rules_1                         rules_2
#  1:                Denny Frying Pan          Denny C-Size Batteries
#  2:            Denny Scented Tissue              Denny Paper Plates
#  3:   Blue Label Fancy Canned Clams Blue Label Canned Tuna in Water
#  4:             Denny Plastic Forks              Golden Frozen Peas
#  5:                Denny Frying Pan          Denny D-Size Batteries
#  6:             Denny Plastic Forks   Faux Products Apricot Shampoo
#  7:              Golden Frozen Peas             Denny Plastic Forks
#  8:   Faux Products Apricot Shampoo             Denny Plastic Forks
#  9: Blue Label Canned Tuna in Water   Blue Label Fancy Canned Clams
# 10:  Blue Label Canned String Beans  Faux Products Buffered Aspirin
# 11:          Denny D-Size Batteries                Denny Frying Pan

library(dplyr)
library(tidyr)

data.frame(rules) %>%
  mutate(rules = gsub("\\s+=>\\s+", "=>", rules)) %>%
  mutate(rules = gsub("[{}]", "", rules)) %>%
  separate(rules, into = c("V1", "V2"), sep = "=>")

【讨论】：

【解决方案3】：

这是我坚持使用 qdapRegex 的一种方法：

RULES <- c("{Denny Frying Pan} => {Denny C-Size Batteries}",
           "{Denny Scented Tissue} => {Denny Paper Plates}",
           "{Blue Label Fancy Canned Clams} => {Blue Label Canned Tuna in Water}",
           "{Denny Plastic Forks} => {Golden Frozen Peas}",
           "{Denny Frying Pan} => {Denny D-Size Batteries}",
           "{Denny Plastic Forks} => {Faux Products Apricot Shampoo}",
           "{Golden Frozen Peas} => {Denny Plastic Forks}",
           "{Faux Products Apricot Shampoo} => {Denny Plastic Forks}",
           "{Blue Label Canned Tuna in Water} => {Blue Label Fancy Canned Clams}",
           "{Blue Label Canned String Beans} => {Faux Products Buffered Aspirin}",
           "{Denny D-Size Batteries} => {Denny Frying Pan}")

library(qdapRegex)
setNames(do.call(rbind.data.frame, rm_curly(RULES, extract=TRUE)), c("LHS", "RHS"))

##                                LHS                             RHS
## 1                 Denny Frying Pan          Denny C-Size Batteries
## 2             Denny Scented Tissue              Denny Paper Plates
## 3    Blue Label Fancy Canned Clams Blue Label Canned Tuna in Water
## 4              Denny Plastic Forks              Golden Frozen Peas
## 5                 Denny Frying Pan          Denny D-Size Batteries
## 6              Denny Plastic Forks   Faux Products Apricot Shampoo
## 7               Golden Frozen Peas             Denny Plastic Forks
## 8    Faux Products Apricot Shampoo             Denny Plastic Forks
## 9  Blue Label Canned Tuna in Water   Blue Label Fancy Canned Clams
## 10  Blue Label Canned String Beans  Faux Products Buffered Aspirin
## 11          Denny D-Size Batteries                Denny Frying Pan

我们提取花括号之间的内容，然后使用do.call + rbind.data.frame 强制转换为data.frame。

【讨论】：