【问题标题】:Perl - Extract string from text blockPerl - 从文本块中提取字符串
【发布时间】:2020-10-20 06:14:03
【问题描述】:

我有一些这样的文本存储在一个变量中:

<ACCEPTANCE-DATETIME>20201014084217
ACCESSION NUMBER:       0001225208-20-012454
CONFORMED SUBMISSION TYPE:  4
PUBLIC DOCUMENT COUNT:      1
CONFORMED PERIOD OF REPORT: 20201012
FILED AS OF DATE:       20201014
DATE AS OF CHANGE:      20201014

我需要的是搜索字符串,并提取这一行:

访问号码:0001225208-20-012454

更具体地说,数字:0001225208-20-012454,没有破折号。

似乎找不到正确的语法:

my $access_no = $txt =~ /ACCESSION NUMBER/m;

这不行。

【问题讨论】:

  • 正则表达式 () 分组和分配左侧部分的列表上下文可以完成这项工作。 my ($access_no) = $txt =~ /ACCESSION NUMBER: \s+ (\S+) /x;\s 可以替换为 \h 以仅考虑水平空白)
  • 或者用一点不同的方法my $access_no = $1 if $txt =~ /ACCESSION NUMBER:\s+(.*)$/m;

标签: regex string perl extract matching


【解决方案1】:

有多种方法可以做到这一点。一种方法是:

my $access_no = ''; 
$access_no = $1 . $2 . $3 if $txt =~ m/ACCESSION NUMBER:\s+(\d+)-(\d+)-(\d+)/;

【讨论】:

    【解决方案2】:

    这是一个解决方案,它将整个字符串解析为适当的数据结构(散列),然后更改所需的散列元素。此方法比 l4chsalter 的方法长,但可能更容易维护和扩展,以防您还需要其余字段。

    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    use feature qw( say );
    
    my $txt = <<'EOF';
    <ACCEPTANCE-DATETIME>20201014084217
    ACCESSION NUMBER:       0001225208-20-012454
    CONFORMED SUBMISSION TYPE:  4
    PUBLIC DOCUMENT COUNT:      1
    CONFORMED PERIOD OF REPORT: 20201012
    FILED AS OF DATE:       20201014
    DATE AS OF CHANGE:      20201014
    EOF
    
    # Parse the entire string into the hash with keys/values:
    my %val = $txt =~ m{ ^ ( [^:\n]+ ): \s+ ( \S+ .*? ) $ }gxms;
    
    # Print the hash with the parsed string (optional):
    # say "'$_' => '$val{$_}'" for keys %val;
    
    # Remove non-digits from the desired element:
    $val{'ACCESSION NUMBER'} =~ tr/0-9//cd;
    
    say $val{'ACCESSION NUMBER'};
    # 000122520820012454
    

    my %val = $txt =~ m{ ^ ( [^:\n]+ ): \s+ ( \S+ .*? ) $ }gxms; :捕获括号中的模式,将它们作为 LIST 返回并将列表分配给哈希 %val。它的键将是字段名称和值 - 对应的字段值。

    正则表达式使用这些修饰符:/g 返回多个匹配项,/x 忽略正则表达式中的空格和 cmets 以提高可读性,/m 匹配多行,/s 使 . 匹配换行符(可选,此处未使用,但我喜欢在复杂的正则表达式中使用它以提高可维护性)。

    ^ ( [^:\n]+ ): 匹配任何字符,从行首 (^) 开始,不是冒号或换行符,重复 1 次或多次,直到第一个冒号。因此,括号捕获了字段名称。

    ( \S+ .*? ) $ 匹配非空白字符后跟任何字符 0 次或更多次,直到行首结束 ($)。因此,括号捕获了字段值。

    另请参阅:
    perldoc perlre: Perl regular expressions (regexes)
    perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
    perldoc perlrequick: Perl regular expressions quick start

    【讨论】:

      【解决方案3】:

      有很多方法可以提取感兴趣的数据并对其进行操作。

      下面是简单方法

      的代码
      use strict;
      use warnings;
      use feature 'say';
      
      my $data = do { local $/; <DATA> };
      
      my $num = $1 if $data =~ /ACCESSION NUMBER:\s+(.*)$/m;
      
      say 'Extracted: ' . $num;
      $num =~ s/-//g;
      say 'Processed: ' . $num;
      
      __DATA__
      <ACCEPTANCE-DATETIME>20201014084217
      ACCESSION NUMBER:       0001225208-20-012454
      CONFORMED SUBMISSION TYPE:  4
      PUBLIC DOCUMENT COUNT:      1
      CONFORMED PERIOD OF REPORT: 20201012
      FILED AS OF DATE:       20201014
      DATE AS OF CHANGE:      20201014
      

      输出

      Extracted: 0001225208-20-012454
      Processed: 000122520820012454
      

      现在更复杂的方法将事件记录提取到数据结构中以供进一步操作

      use strict;
      use warnings;
      use feature 'say';
      
      use Data::Dumper;
      
      my $data = do { local $/; <DATA> };
      
      my %event  = $data =~ /<(.*?)>(.*)$/m;
      my %record = $data =~ /(.*?):\s+(.*)$/gm;
      
      $event{record} = \%record;
      
      say '--- Read record '        . '-' x 29;
      say $data;
      say '--- Content of %record ' . '-' x 22;
      say Dumper(\%record);
      say '--- Content of %event '  . '-' x 23;
      say Dumper(\%event);
      say '-' x 45;
      
      my $num0 = $record{'ACCESSION NUMBER'};
      my $num1 = $num0;
      my $num2 = $num0;
      my $num3 = $num0;
      my @parts = split '-', $num0;
      
      $num1 =~ s/-//g;
      $num2 =~ s/\D//g;
      $num3 =~ tr/-//d;
      
      say '$num0 = ' . $num0;
      say '$num1 = ' . $num1;
      say '$num2 = ' . $num2;
      say '$num3 = ' . $num3;
      say '@parts = ' . join '', @parts;
      
      __DATA__
      <ACCEPTANCE-DATETIME>20201014084217
      ACCESSION NUMBER:       0001225208-20-012454
      CONFORMED SUBMISSION TYPE:  4
      PUBLIC DOCUMENT COUNT:      1
      CONFORMED PERIOD OF REPORT: 20201012
      FILED AS OF DATE:       20201014
      DATE AS OF CHANGE:      20201014
      

      输出

      --- Read record -----------------------------
      <ACCEPTANCE-DATETIME>20201014084217
      ACCESSION NUMBER:       0001225208-20-012454
      CONFORMED SUBMISSION TYPE:  4
      PUBLIC DOCUMENT COUNT:      1
      CONFORMED PERIOD OF REPORT: 20201012
      FILED AS OF DATE:       20201014
      DATE AS OF CHANGE:      20201014
      --- Content of %record ----------------------
      $VAR1 = {
                'FILED AS OF DATE' => '20201014',
                'CONFORMED PERIOD OF REPORT' => '20201012',
                'CONFORMED SUBMISSION TYPE' => '4',
                'ACCESSION NUMBER' => '0001225208-20-012454',
                'PUBLIC DOCUMENT COUNT' => '1',
                'DATE AS OF CHANGE' => '20201014'
              };
      
      --- Content of %event -----------------------
      $VAR1 = {
                'record' => {
                              'FILED AS OF DATE' => '20201014',
                              'CONFORMED PERIOD OF REPORT' => '20201012',
                              'CONFORMED SUBMISSION TYPE' => '4',
                              'ACCESSION NUMBER' => '0001225208-20-012454',
                              'PUBLIC DOCUMENT COUNT' => '1',
                              'DATE AS OF CHANGE' => '20201014'
                            },
                'ACCEPTANCE-DATETIME' => '20201014084217'
              };
      
      ---------------------------------------------
      $num0 = 0001225208-20-012454
      $num1 = 000122520820012454
      $num2 = 000122520820012454
      $num3 = 000122520820012454
      @parts = 000122520820012454
      

      【讨论】:

        猜你喜欢
        • 2011-10-09
        • 1970-01-01
        • 2013-08-29
        • 2012-02-25
        • 1970-01-01
        相关资源
        最近更新 更多