提取章节标题中的段落[关闭]答案

【问题标题】：Extracting paragraphs within section headings [closed]提取章节标题中的段落[关闭]
【发布时间】：2019-10-23 17:08:50
【问题描述】：

我的文本（通过 readtext 读入）如下所示：

Lorem Ipsum 的第一个摘要

Lorem Ipsum 只是打印和排版的虚拟文本行业。 Lorem Ipsum 一直是业界标准的虚拟文本自 1500 年代以来，当一位不知名的印刷商采用了一种类型的厨房和争先恐后地制作了一本类型样本书。

Lorem Ipsum 的第二次总结

它不仅经历了五个世纪，而且还跨越了电子排版，基本保持不变。它是 1960 年代随着 Letraset 床单的发布而普及包含 Lorem Ipsum 段落，最近还有桌面发布软件，如 Aldus PageMaker，包括 Lorem 版本 Ipsum。

我想单独提取这两个部分，不带部分标题，并将它们保存为 R 中的两个不同字符串，以便我可以将它们转换回单独的 .txt 文件。

【问题讨论】：

到目前为止有什么努力吗？字符串成为有效标题的规则是什么？
如何识别标题和段落？标题后面可以有多个段落吗？如果它是恒定的，您可以简单地在 (?:\r\n|[\r\n])[ \t]*(?:\r\n|[\r\n]) 上拆分您的文档并提取每隔一个结果（数组中的位置 0、2、4、6、...）
这个问题已经在 SO 上被问过好几次了。例如，here、here 和 here。
@MitchPudil 如何你如何识别标题？我们对您的问题的了解与您不同，因此当您没有确定格式、我们回答所需的信息以及您遇到的问题时，很难说出您需要什么。
@MitchPudil 虽然不能帮助我识别标题，但必须有某种规则，或者包含所有标题的列表变量，以便我们识别标题。现在，我可以真正说识别标题的唯一方法是它是您发布的文本中的第 0 和第 2 句，或者当标题没有时，段落以 . 结尾。正则表达式是一组规则，但我们无法帮助您，因为只有您知道您需要的格式。如果没有它必须遵守的规则，我们甚至无法开始生成正确正则表达式模式。

标签： r regex

【解决方案1】：

您可以使用正则表达式（使用strsplit）拆分字符串，然后使用setdiff 删除titles 和strsplit 的结果之间的相似之处。

See code in use here

titles <- list("First Summary of Lorem Ipsum", "Second Summary of Lorem Ipsum")

s <- "First Summary of Lorem Ipsum

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

Second Summary of Lorem Ipsum

It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

a <- unlist(strsplit(s, "\\h*\\R\\h*\\R\\h*", perl=T))
setdiff(a, titles)

以上结果：

[1] "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book."                                                                                   
[2] "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

\\h*\\R\\h*\\R\\h* 上面的正则表达式的解释。为了简单起见，我删除了下面的双反斜杠（它只是 R 中的字符转义）：

\h 匹配水平空格
* 量化前一个标记（在上面的正则表达式 \h 中）以匹配它零次或多次
\R 匹配任何 Unicode 换行符序列（\r\n 或 \r 或 \n）

正则表达式匹配两个换行符（在它们内部或周围有任意数量的水平空格，以防输入有类似\r\n\t\r\n 的内容）。

非 Perl 等价物是：

[ \\t]*(?:\\r\\n|[\\r\\n])[ \\t]*(?:\\r\\n|[\\r\\n])[ \\t]*

【讨论】：