从字符向量的第 n 行到第 n+x 行的子字符串答案

【问题标题】：Substring from nth line to nth+x line of a character vector从字符向量的第 n 行到第 n+x 行的子字符串
【发布时间】：2014-05-10 06:04:41
【问题描述】：

我有一个字符向量

string <- "First line\nSecond line\nthird line\n\nFourth line\nFifth line"

从诗中创造出来

1 First line
2 Second line
3 Third line

4 Fourth line
5 Fifth line

我想将第 3 节到第 5 节或第 3 到第 5 行的向量子串（空行不计在内，不应计入）。除了第一行之外的每一行都可能以\n 或\n\n 开头。我不知道行的内容（当然），也不知道第 3 行和第 5 行之间有多少空行（\n\n）。然后我想得到

substring <- "third line\n\nFourth line\nFifth line"

然后可以呈现为

3 Third line

4 Fourth line
5 Fifth line

【问题讨论】：

你能再举一些例子吗？因为看起来你在数行数很有趣。所以\n 对你来说并不意味着新行吗？您需要第 3、4 和 5 条非空行吗？

标签： string r substring

【解决方案1】：

使用strsplit 我们将字符串分成组。然后删除第一组中直到最后一个 \n 的所有内容，留下最后一行并将其与第二组一起粘贴：

groups <- strsplit(string, "\n\n+")[[1]]
paste(sub(".*\n", "", groups[1]), groups[2], sep = "\n\n")

给予：

[1] "third line\n\nFourth line\nFifth line"

注意上面总是在第一组的最后一行和第二组的第一行之间放置两个\n，即使原来有更多。如果保留\n 的数量很重要，那么提取分隔符seps 并从中选择第一个具有超过1 个字符的字符。在最后的 paste 中使用它：

seps <- strsplit(string, "[^\n]+")[[1]]
sep <- seps[nchar(seps) > 1][1] # 1st multiple \n separator

groups <- strsplit(string, "\n\n+")[[1]]
paste(sub(".*\n", "", groups[1]), groups[2], sep = sep)

修订添加了注释并略有改进。

【讨论】：

【解决方案2】：

好的，我添加了更多测试并为我认为应该包含的行加了星标

1:-----  
    First line
    Second line
    third line (*)
    <blank>
    Fourth line (*)
    Fifth line (*)
2:-----
    <blank>
    <blank>
    aaaa
    bbbbb
    ccccc (*)
    dddddd (*)
    eeeeee (*)
    ffffff
    <blank>
3:-----
    11111
    <blank>
    222222
    <blank>
    333333 (*)
    <blank>
    4444444 (*)
    <blank>
    555555 (*)

如果是这样，那么我认为这应该找到他们所有

tests<-c(
    "First line\nSecond line\nthird line\n\nFourth line\nFifth line",
    "\n\naaaa\nbbbbb\nccccc\ndddddd\neeeeee\nffffff\n",
    "11111\n\n222222\n\n333333\n\n4444444\n\n555555"
)
gsub("^\\n*[^\\n]+\\n+[^\\n]+\\n+([^\\n]+\\n+[^\\n]+\\n+[^\\n]+)[\\s\\S]*", "\\1", tests, perl=T)
#[1] "third line\n\nFourth line\nFifth line"
#[2] "ccccc\ndddddd\neeeeee"     
#[3] "333333\n\n4444444\n\n555555"

【讨论】：

【解决方案3】：

您可以gsub 直到第二行的末尾以获取第三行到字符串的末尾。

> gsub('^.*Second line\n', '', string)
[1] "third line\n\nFourth line\nFifth line"

或者以同样的方式使用strsplit。

> strsplit(string, '^.*Second line\n')[[1]][2]
[1] "third line\n\nFourth line\nFifth line"

此外，readLines 也可以解决问题。

> x <- readLines(textConnection(string))
> gg <- grep('third|fifth', x, ignore.case = TRUE)
> x[gg[1]:gg[2]]
[1] "third line"  ""            "Fourth line" "Fifth line"

【讨论】：