如何使用 REGEX perl 提取两个模式之间的文本答案

【问题标题】：How to extract the text between two patterns using REGEX perl如何使用 REGEX perl 提取两个模式之间的文本
【发布时间】：2011-06-04 16:33:36
【问题描述】：

在以下几行中，如何使用 REGEX PERL 将“Description:”和“Tag:”之间的行存储在变量中，什么是好的数据类型使用，字符串或列表或其他什么？

（我正在尝试用 Perl 编写一个程序，以提取带有 Debian 软件包信息的文本文件的信息并将其转换为 RDF（OWL）文件（本体）。）

描述： 用于解码 ATSC A/52 流的库（开发） liba52 是用于解码 ATSC A/52 流的免费库。 A/52 标准是用于各种应用，包括数字电视和 DVD。这是也称为 AC-3。

此包包含开发文件。主页：http://liba52.sourceforge.net/

标签： devel::library, role::devel-lib

目前我写的代码是：

#!/usr/bin/perl
open(DEB,"Packages");
open(ONT,">>debianmodelling.txt");

$i=0;
while(my $line = <DEB>)
{

    if($line =~ /Package/)
    {
        $line =~ s/Package: //;
        print ONT '  <package rdf:ID="instance'.$i.'">';
        print ONT    '    <name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</name>'."\n";
    }
elsif($line =~ /Priority/)
{
    $line =~ s/Priority: //;
    print ONT '    <priority rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</priority>'."\n";
}

elsif($line =~ /Section/)
{
    $line =~ s/Section: //;
    print ONT '    <Section rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Section>'."\n";
}

elsif($line =~ /Maintainer/)
{
    $line =~ s/Maintainer: //;
    print ONT '    <maintainer rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</maintainer>'."\n";
}

elsif($line =~ /Architecture/)
{
    $line =~ s/Architecture: //;
    print ONT '    <architecture rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</architecture>'."\n";
}
elsif($line =~ /Version/)
{
    $line =~ s/Version: //;
    print ONT '    <version rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</version>'."\n";
}
elsif($line =~ /Provides/)
{
    $line =~ s/Provides: //;
    print ONT '    <provides rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</provides>'."\n";
}
elsif($line =~ /Depends/)
{
    $line =~ s/Depends: //;
    print ONT '    <depends rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</depends>'."\n";
}
elsif($line =~ /Suggests/)
{
    $line =~ s/Suggests: //;
    print ONT '    <suggests rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</suggests>'."\n";
}

elsif($line =~ /Description/)
{
    $line =~ s/Description: //;
    print ONT '    <Description rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Description>'."\n";
}
elsif($line =~ /Tag/)
{
    $line =~ s/Tag: //;
    print ONT '    <Tag rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Tag>'."\n";
    print ONT '  </Package>'."\n\n";
}
$i=$i+1;
}

【问题讨论】：

由于选择最佳数据类型完全取决于您对数据的预期用途，因此您需要对您的目标进行一些解释。
@Rob Raisch：我很抱歉没有把问题放在开头。这样可以吗？
@Rob 我只需要存储在变量中即可将其复制到文件中。
好的，刚发现，其实我一直在问关于大项目启动的问题，所以短时间内测试所有答案并不容易。

标签： regex perl

【解决方案1】：

my $desc = "Description:";
my $tag  = "Tag:";

$line =~ /$desc(.*?)$tag/;
my $matched = $1;
print $matched;

或

my $desc = "Description:";
my $tag  = "Tag:";

my @matched = $line =~ /$desc(.*?)$tag/;
print $matched[0];

或

my $desc = "Description:";
my $tag  = "Tag:";

(my $matched = $line) =~ s/$desc(.*?)$tag/$1/;
print $matched;

附加

如果您的描述和标签可能位于不同的行，您可能需要使用/s 修饰符将其视为单行，因此\n 不会破坏它。示例：

$_=qq{Description:foo 
      more description on 
      new line Tag: some
      tag};
s/Description:(.*?)Tag:/$1/s; #notice the trailing slash
print;

【讨论】：

HTML 解析器究竟如何减少工作量？很抱歉问一个新手问题，因为我对语义网络世界完全陌生。
对不起，我以为你在解析 XML 文档。我已经从我的答案中删除了它。见上文。不过，看起来您正在构建一个 XML 文档，所以也许您毕竟可以使用 HTML/XML 包。
+1。但是，第三个实例 ((my $matched = $line) =~ s/$desc(.*?)$tag/$1/;) 对我不起作用；它只是从$line 中剥离了$desc 和$tag 的内容；因此，$matched 包含了该行的剩余内容。
@user001 是的，我不知道我当时在想什么，这些例子中的大多数都是在我旅行时最有可能出现的。 s/$desc(.*?)$tag/$1/ 正在替换。为了产生任何影响，需要包含该行的其余部分：s/.*$desc(.*?)$tag.*/$1/（我认为这会起作用）

【解决方案2】：

假设：

my $example; # holds the example text above

你可以：

(my $result=$example)=~s/^.*?\n(Description:)/$1/s; # strip up to first marker

$result=~s/(\nTag:[^\n]*\n).+$/$1/s; # strip everything after second marker line

或者

(my $result=$example)=~s/^.*?\n(Description:.+?Tag:[^\n]*\n).*$/$1/s;

两者都假设 Tag: 值包含在一行中。

如果不是这样，你可以试试：

(my $result=$example)=~s/
    (                        # start capture
        Description:         # literal 'Description:'
        .+?                  # any chars (non-greedy) up to
        Tag:                 # literal 'Tag:'
        .+?                  # any chars up to
    )
    (?:                      # either
      \n[A-Z][a-z]+\:        #  another tagged value name 
    |                         # or
      $                       #  end of string
    )
/$1/sx;

【讨论】：

【解决方案3】：

我认为问题是由于对段落结构的数据使用行阅读循环引起的。如果您可以将文件 slurp 到内存中并使用捕获的分隔符应用 split，则处理将更加顺畅：

#!/usr/bin/perl -w

use strict;
use diagnostics;
use warnings;

use English;

# simple sample sub
my $printhead = sub {
  printf "%5s got the tag '%s ...'\n", '', substr( shift, 0, 30 );
};
# map keys/tags? to functions
my %tagsoups = (
    'PackageName' => sub {printf "%5s got the name '%s'\n", '', shift;}
  , 'Description' => sub {printf "%5s got the description:\n---------\n%s\n----------\n", '', shift;}
  , 'Tag'         => $printhead
);
# slurp Packages (fallback: parse using $INPUT_RECORD_SEPARATOR = "Package:")
open my $fh, "<", './Packages-00.txt' or die $!;
local $/; # enable localized slurp mode
my $all = <$fh>;
my @pks = split /^(Package):\s+/ms, $all;
close $fh;
# outer loop: Packages
for (my $p = 1, my $n = 0; $p < scalar @pks; $p +=2) {
  my $blk = "PackageName: " . $pks[$p + 1];
  my @inf = split /\s*^([\w-]+):\s+/ms, $blk;
  printf "%3d %s named %s\n", ++$n, $pks[$p], $inf[ 2 ];
  # outer loop: key-value-pairs (or whatever they are called)
  for (my $x = 1; $x < scalar @inf; $x += 2) {
      if (exists($tagsoups{$inf[ $x ]})) {
          $tagsoups{$inf[ $x ]}($inf[$x + 1]);
      }
  }
}

我的 Ubuntu Linux 中缩短的 Packages 文件的输出：

  3 Package named abrowser-3.5-branding
      got the PackageName:
---------
abrowser-3.5-branding
----------
      got the Description:
---------
dummy upgrade package for firefox-3.5 -> firefox
 This is a transitional package so firefox-3.5 users get firefox on
 upgrades. It can be safely removed.
----------
  4 Package named casper
      got the PackageName:
---------
casper
----------
      got the Description:
---------
Run a "live" preinstalled system from read-only media
----------
      got the Tag:
---------
admin::boot, admin::filesystem, implemented-in::shell, protocol::smb, role::plugin, scope::utility, special::c
ompletely-tagged, works-with-format::iso9660
----------

使用哈希函数将函数应用于提取的部分将使生成 xml 的详细信息远离解析器循环。

【讨论】：