按R中的位置匹配两个字符串答案

【问题标题】：Match two character strings by location in R按R中的位置匹配两个字符串
【发布时间】：2020-06-08 13:48:14
【问题描述】：

string <- paste(append(rep(" ", 7), append("A", append(rep(" ", 8), append("B", append(rep(" ", 17), "C"))))), collapse = "")
text <- paste(append(rep(" ", 7), append("I love", append(rep(" ", 3), append("chocolate", append(rep(" ", 9), "pudding"))))), collapse = "")

string
[1] "       A        B                 C"
text
[1] "       I love   chocolate         pudding"

我正在尝试将“string”中的字母与“text”中的文本匹配，以便字母 A 对应文本“我爱”，B 对应“巧克力”，C 对应“布丁”。理想情况下，我想将 A、B、C 放在第 1 列和数据框（或 tibble）的三个不同行中，并将文本放在第 2 列和相应的行中。有什么建议吗？

【问题讨论】：

你的texts 都是 4 个字长还是不同？

标签： r

【解决方案1】：

很难知道您尝试在其中操作然后整理到data.frame 中的列中的字符串是否遵循某种模式。但是对于您发布的示例，我建议使用字符串 (strings) 创建一个列表：

strings <- list(string, text)

然后使用lapply() 依次为strings 中的每个元素创建一个列表。

res <-lapply(strings, function(x){
  grep(x=trimws(unlist(strsplit(x, "\\s\\s"))), pattern="[[:alpha:]]", value=TRUE)
})

在上面的代码中，只要找到两个空格，strsplit() 就会拆分字符串 (\\s\\s)。但是生成的拆分是一个以字符串作为内部元素的列表。因此，您需要使用unlist()，以便您可以将其与grep() 一起使用。 grep() 将只选择那些带有字母数字字符的字符串——这是你想要的。

然后您可以使用do.call(cbind, list) 将生成的lapply() 列表中的元素绑定到列中。尺寸必须与此作品相匹配。

do.call(cbind, res)

结果：

> do.call(cbind, res)
     [,1] [,2]       
[1,] "A"  "I love"   
[2,] "B"  "chocolate"
[3,] "C"  "pudding"

例如，您可以将其包装成 as.data.frame() 以获得所需的结果：

> as.data.frame(do.call(cbind, res), stringsAsFactors = FALSE)
  V1        V2
1  A    I love
2  B chocolate
3  C   pudding

【讨论】：

【解决方案2】：

您可以使用read.fwf 并使用nchar 获取职位。

read.fwf(file=textConnection(text),
 widths=c(diff(c(1, gregexpr("\\w", string)[[1]])), nchar(text)))[-1]
#         V2                 V3      V4
#1 I love    chocolate          pudding

如果应该删除空格，请同时使用trimws:

trimws(read.fwf(file=textConnection(text),
 widths=c(diff(c(1, gregexpr("\\w", string)[[1]])), nchar(text)))[-1])
#[1] "I love"    "chocolate" "pudding"

【讨论】：

对于 C 和 "pudding" 初始字符位置不同的情况，您有什么建议吗？（例如，C 在位置 23，布丁从 20 开始，使 C 与“ding”匹配，B 与“chocolate pud”匹配）
也许可以针对这种情况提出一个新问题，因为给定的解决方案不容易适应这种新情况。

【解决方案3】：

根据您的数据，我通过使用包 stringr 提出了这个解决方法。这只适用于那种模式，所以如果你有不稳定的模式，你需要调整它。

输出是data.frame，其中两列由您的两个输入数据和根据匹配的行给出。

library(stringr)

string <- paste(append(rep(" ", 7), append("A", append(rep(" ", 8), append("B", append(rep(" ", 17), "C"))))), collapse = "")
text <- paste(append(rep(" ", 7), append("I love", append(rep(" ", 3), append("chocolate", append(rep(" ", 9), "pudding"))))), collapse = "")

string_nospace <- str_replace_all( string, "\\s{1,20}", " " )
string_nospace <- str_trim( string_nospace )
string_nospace <- data.frame( string = t(str_split(string_nospace, "\\s", simplify = TRUE)))

text_nospace <- str_replace_all( text, "\\s{2,20}", "_" )
text_nospace <- str_sub(text_nospace, start = 2)
text_nospace <- data.frame(text = t(str_split(text_nospace, "_", simplify = TRUE)))

df = data.frame(string = string_nospace, 
                text = text_nospace )
df
#>   string      text
#> 1      A    I love
#> 2      B chocolate
#> 3      C   pudding

^{由reprex package (v0.3.0) 于 2020-06-08 创建}

【讨论】：