如何根据 RegEx 模式将文件拆分为多个文件？答案

【问题标题】：How do I split a file into multiple file based on a RegEx pattern?如何根据 RegEx 模式将文件拆分为多个文件？
【发布时间】：2021-03-26 17:40:35
【问题描述】：

我想根据特定的正则表达式模式将一个文件拆分为多个文件。我在下面提供了一个可重现的示例。如果有更简单的解决方案，我也欢迎！

我有一个包含以下文件的目录：

page1.html page2.html page3.html

假设我的 page1.html 看起来像这样：

<strong>Hello world</strong>

<p>ABC, Page (1 whatever).</p>
<p>Some text</p>

<p>DEF, Page (1 ummm what).</p>
<p>Some text</p>

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>

我想将 page1.html 拆分为：

page1_0.html

<strong>Hello world</strong>

page1_1.html

<p>ABC, Page (1 whatever).</p>
<p>Some text</p>

page1_2.html

<p>DEF,  Page (1 ummm what).</p>
<p>Some text</p>

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>

我想要用以下模式识别行的代码：

[0 to 10 characters in the beginning] , Page (1 [0 to 10 characters here]). </p>

我目前有以下代码：

for filename in *.html; gcsplit -z -f "${filename%.*}_" --suffix-format="%d.html" $filename /'Page (1'/ '{*}'

但这是在创建一个 page1_3.html，其中包含以下文本：

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>

但是当我运行这个时：

for filename in *.html; gcsplit -z -f "${filename%.*}_" --suffix-format="%d.html" $filename /'^.{0,10}, Page \(1.{0,10}\).\<\/p\>'/ '{*}'

这只是输出文件 page1_0.html。

我的正则表达式有什么问题？有没有其他方法可以实现我想要做的事情？

【问题讨论】：

标签： html regex zsh csplit

【解决方案1】：

你可以用这个简短的 Perl 脚本来做到这一点。

#chunker.pl
use 5.022;
use strict;
use diagnostics;
use B "perlstring";

our $i = 0;
our $fmt = "page1_%d.html";
our $fn = sprintf $fmt, $i;

open our $fh, ">", $fn or die $!;
print "opened $fn\n";
while (<<>>) {
  printf "read line $.: %s\n", perlstring $_;
  if (m{^.{0,10}?, Page \(1 [^)]{0,10}?\)\.</p>}) {
    print "break matched line $.\n";
    $fn = sprintf $fmt, ++$i;
    open $fh, ">", $fn or die $!;
    print "opened $fn\n";
  }
  print $fh $_;
}

打印：

$ perl chunker.pl page1.html

opened page1_0.html
read line 1: "<strong>Hello world</strong>\n"
read line 2: "\n"
read line 3: "<p>ABC, Page (1 whatever).</p>\n"
break matched line 3
opened page1_1.html
read line 4: "<p>Some text</p>\n"
read line 5: "\n"
read line 6: "<p>DEF, Page (1 ummm what).</p>\n"
break matched line 6
opened page1_2.html
read line 7: "<p>Some text</p>\n"
read line 8: "\n"
read line 9: "<p>THE<em><strong><span class=\"underline\">GHI</span></strong></em>JK <em><strong><span class=\"underline\">the</span></strong></em>LMNOP<em><strong><span class=\"underline\">Q</span></strong></em>RS.<p> ABC, Page (1).</p>\n"
read line 10: "\n"
read line 11: "\n"



$ for f in page1_*.html; do echo "$f:"; cat $f; echo; done;
page1_0.html:
<strong>Hello world</strong>


page1_1.html:
<p>ABC, Page (1 whatever).</p>
<p>Some text</p>


page1_2.html:
<p>DEF, Page (1 ummm what).</p>
<p>Some text</p>

<p>THE<em><strong><span class="underline">GHI</span></strong></em>JK <em><strong><span class="underline">the</span></strong></em>LMNOP<em><strong><span class="underline">Q</span></strong></em>RS.<p> ABC, Page (1).</p>

我认为您的正则表达式的问题在于您需要非贪婪匹配。

.{0,10}? 最少零到十个
, Page $1
[^)]{0,10}? 最少零到十个非右括号
$\.</p> 然后是右括号

HTH

【讨论】：

我运行了 perl 脚本，输出由 page1_0.html（标题）和 page1_1.html（页面的其余部分）组成。
我只是用答案和您的示例文本中的确切代码重试了它，我得到了相同的输出。您是否使用问题中的确切示例文本进行了尝试？
没关系。我又试了一次，它奏效了。不知道第一次发生了什么。对此感到抱歉。
您也可以将其压缩到一个行中以在命令行上执行此操作。

【解决方案2】：

^.{0,10}, Page $1.{0,10}$.\<\/p\>

我的正则表达式有什么问题？

这不是 POSIX BRE。试试^.\{0,10\}, Page (1.\{0,10\}).<\/p>。

/ 是\/ 只是因为它被用作csplit 工具的/REGEXP/[offset] 参数。您可能希望将最后一个 . 更改为 \. 以匹配您的点字符。

【讨论】：

我使用了^.\{0,10\}, Page (1.\{0,10\})\.<\/p>，得到了一个 page1_0.html 文件（只有 hello world 标头）和一个 page1_1.html（以及页面的其余部分）。
[^>]*[^<]* 是怎么回事？
在完成^.\{0,50\}Page (1 之后，我或多或少能够实现我想要做的事情。